In this edition of our monthly Jargon buster series, we define five terms relating to how technology helps us make sense of large amounts of data in the field of Data Science. Using various processes, technology can extract information and analyse data using statistical models and neural networks to draw key conclusions from unreadable haystack more efficiently.
Having patterns and key information identified quickly helps companies reach data-backed decisions faster. That's why many companies now have roles in data science. Clifford Chance has a data science team too to help with data extraction and natural language processing.
A multi-disciplinary field that uses statistics, data analysis, machine learning and any related methods to extract knowledge and insights from structured and unstructured data. A typical data science life cycle involves capturing the right data in a meaningful way, maintaining the data by cleansing and setting up data architectures, processing the data using clustering, modelling, summarising etc, analysing using methods such as predictive analysis, and finally communicating using visualisation tools to reach business decisions.
Text mining is the process of deriving high-quality information from text that may not be easily identifiable by humans by identifying trends through means such as statistical pattern learning. Typical text mining requests include text categorisation, clustering, concept extraction, sentiment analysis, summarisation etc. Usually these tasks are lined up in a pipeline so that we can reach the most valuable information and insights at the end of the process.
Metadata is data describing the data you are analysing. In more practical terms, metadata often takes the form of keywords, description terms or titles of webpages in order to improve the search engine optimisation (SEO) of the page.
Optical Character Recognition
OCR is the technology and process of converting text from an image format into a format that the computer can manipulate. In other words, it makes text documents editable and searchable. The programme works by recognising the characters in the text and producing a content in a text document. This is crucial for large legal documents or documents with high amounts of data, allowing users to edit and search much more easily.
Sentiment analysis uses linguistics and natural language processing to systemically identify and study the emotional states the text presents. This has been useful in making friendly chats (who can identify when a customer is particularly annoyed) and recommender system that predicts the rating a user would give to an item.