This edition of the Jargon buster is follow-up from last month's, looking at how technology can help make sense of large amounts of data. Last time out we looked at data science and how patterns and key information are quickly identified by technology. In this edition we define how different types of computer and programming languages – SQL, Python and R – are used in database management systems and for statistical computing. Also, the output of this as we look at what software does with the data – ETL (Extract, transform and load) – and tools that visualise the findings for presentations and reports.
Also pronounced as "sequel", SQL (Structured Query Language) is a type of declarative programming language that is used to communicate with relational databases. Some common relational database management systems that use SQL are Oracle, Microsoft Azure SQL Database using a version of PostgreSQL and MySQL among other open source systems.
There are several language elements in one SQL statements, whose order different from the order of execution. It is quite easy to learn, for example one would use "GET" keyword to retrieve data and use "UPDATE" to update a record.
R is a programming language and software environment for statistical computing. The R language is widely used among academics, statisticians and data miners for developing statistical software and data analysis. Designed with dealing with data in mind, it includes a suite of built in operators, tools and graphical facilities for data analysis and visualisation.
Python is another programming language allows data analysis. It is also one of the first languages people first learn to code because the syntax is designed to be simple and to mimic the English language.
Although Python may not have as many libraries as R for data analysis, and require more rigorous testing, its flexibility and extensibility as a general purpose language allows developers to build data science pipelines that include machine learning processing and web framework at scale. This makes Python an essential tool in the data science toolkit.
Data visualisation, the graphic representation of data, is usually one of the reasons why people join the data science field. Everyone wants to make that nice graphic presenting boring data in a cool medium to the public. However, this is easily said than done as it could be hard to identify a meaningful pattern for display. Designers also need to use their judgement carefully to avoid presenting misleading visualisation could convey inaccurate messages.
Some of the most popular visualisation tools today are Tableau, D3.js and PowerBI.
Extract, transform and load are the general procedure of copying data from one or more sources into a destination system that has a different way of representing data from the source system.
Extraction involves extracting data from one or more sources. Data transformation involves cleansing data, restructuring and reformatting data so that it is suitable for querying and analysis. Finally loading refers to the process of inserting data to the target database usually a data warehouse.
This is a common process that ensures data quality and consistency standards so that the warehouse stores data sustainably and presentations can be made without much data preprocessing.