Structured and Unstructured Data
One of the simple ways to think about data is wether it is structured or not. Well, the first thing not all data is created equal or the same. Some data is structured, but most of them is unstructured. The way it is collected are different. From how it is collected, processed, and analyzed. In this post, i want to distinguished the difference between structured and unstructured data.
Structured Data
Structured data mostly refers to highly organized information that contained inside database and readily accessed via simple command. Structured data is most often categorized as quantitative data, and it’s the type of data most of us are used to working with. Think of data that fits neatly within fixed fields and columns in relational databases and spreadsheets. In structured data, we have defined labels. For example
User | Weight | Height |
---|---|---|
Ana | 69 | 172 |
Bob | 88 | 188 |
Enrik | 200 | 170 |
But structured data does not need to be strictly numbers. Data can includes numerical values (discrete, continuous, interval, ratio) and categorical values (nominal, ordinal). What matters for us is that any data we see here – whether it is a numerical or categorical– is labeled. In other words, we know what that number or category means.
Unstructure Data
Unstructured data is most often categorized as qualitative data, and it cannot be processed and analyzed using conventional tools and methods. Simply to say that this kind of data is not labelled.
For example, It was found that bunch of tweet from different users who mentioned Mr. Trump. However, they are not clearly labeled. If we were to do some processing, we would not be able to do that easily. And certainly, if we were to create a systematic process (an algorithm, a program) to go through such data or observations, we would be in trouble because that process would not be able to identify which of these numbers corresponds to which of the quantities.
Examples of unstructured data include text, video, audio, mobile activity, social media activity, satellite imagery, surveillance imagery – the list goes on and on. Unstructured data is difficult to deconstruct because it has no pre-defined model, meaning it cannot be organized in relational databases. Of course, humans have no difficulty understanding a paragraph like this that contains unstructured data. But if we want to do a systematic process for analyzing a large amount of data and creating insights from it, the more structured it is, the better. But at times when such data is not available, we will look to other ways to convert unstructured data to structured data, or process unstructured data, such as text, directly.
Challenge in unstructured data
The lack of structure makes compilation and organizing unstructured data are time and energy consuming task. It would be easy to derive insights from unstructured data if it could be instantly transformed into structured data. However, structured data is kind toward machine language, in that it makes information much easier to be parsed by computers.
Unstructured data, on the other hand, is often how humans communicate (“natural language”); but people do not interact naturally with information in strict, database format. For example, email is unstructured data. An individual may arrange their inbox in such a way that it aligns with their organizational preferences, but that does not mean the data is structured. If it were truly fully structured, it would also be arranged by exact subject and content, with no deviation or variability. In practice, this would not work, because even focused emails tend to cover multiple subjects.
And here is where data science is useful. Because the pool of information is so large, current data mining techniques often miss a substantial amount of available content, much of which could be game-changing if efficiently analyzed.
Leave a comment