CS计算机代考程序代写 algorithm file system data mining cache Chapter 4

Chapter 4
Data in the IoT Ecosystem
In this chapter, we overview the data in the IoT ecosystem. We start with some new concepts and definitions, and then introduce data management, IoT analytics, big data and data analytic life cycle. We them briefly review some of the data analytic techniques.
43

44 CHAPTER 4. DATA IN THE IOT ECOSYSTEM 4.1 IoT Data
The main objective of IoT systems is to gather, process and analyze data to extract information and obtain knowledge. Here, we distinguish these termi- nologies as follows:
• Data: Raw unorganized facts, which are worthless by themselves. • Information: Potentially valuable concepts based on data.
• Knowledge: what we understand based upon information.
• Wisdom: the effective use of knowledge in decision making.
The main goal of IoT systems is to obtain that wisdom. This video u gives an overview of data in the IoT ecosystem.
4.1.1 Characteristic of IoT Data
IoT data has the following characteristics:
1. Big Data : Huge amounts of data are generated, capturing detailed aspects of the processes where devices are involved.
2. Heterogeneous Data : The data is produced by a huge variety of devices and is itself highly heterogeneous, differing on sampling rate, quality of captured values, etc.
3. Real-World Data : The overwhelming majority of the M2M data relates to real-world processes and is dependent on the environment they interact with.
4. Real-Time Data : M2M data is generated in real-time and overwhelmingly can be communicated also in a very timely manner. The latter is of pivotal importance since many times their business value depends on the real-time processing of the info they convey.
5. Temporal Data : The overwhelming majority of M2M data is of temporal nature, measuring the environment over time.
6. Spatial Data : Increasingly, the data generated by M2M interactions are not only captured by mobile devices, but also coupled to interactions in specific locations, and their assessment may dynamically vary depending on the location.
7. Polymorphic Data : The data acquired and used by M2M processes may be complex and involve various data, which can also obtain different meanings depending on the semantics applied and the process they participate in.

4.2. DATA MANAGEMENT 45
8. Proprietary Data : Up to now, due to monolithic application development, a significant amount of M2M data is stored and captured in proprietary formats. However, increasingly due to the interactions with heterogeneous devices and stakeholders, open approaches for data storage and exchange are used.
9. Security and Privacy Data Aspects : Due to the detailed capturing of interactions by M2M, analysis of the obtained data has a high risk of leaking private information and usage patterns, as well as compromising security.
4.2 Data Management
In the era of M2M, where billions of devices interact and generate data at exponential growth rates, data management is of critical importance as it sets the basis upon which any other processes can rely and operate.
The data flow from the moment it is sensed (e.g. by a wireless sensor node) up to the moment it reaches the backend system has been processed manifold (and often redundantly), either to adjust its representation in order to be easily integrated by the diverse applications, or to compute on it in order to extract and associate it with respective business intelligence (e.g. business process affected, etc.). Fig. 4.1 shows the flow of IoT data. In the following, we briefly explain each part of this flow.
Figure 4.1: The flow of IoT data.
4.2.1 Data Generation
Data generation is the first stage within which data is generated actively or pas- sively from the device, system, or as a result of its interactions. The sampling of data generation depends on the device and its capabilities as well as potentially

46 CHAPTER 4. DATA IN THE IOT ECOSYSTEM
the application needs. Usually default behaviours for data generation exist, which are usually further configurable to strike a good benefit between involved costs, e.g. frequency of data collection vs. energy used in the case of WSNs, etc. Not all data acquired may actually be communicated as some of them may be assessed locally and subsequently disregarded, while only the result of the assessment may be communicated.
4.2.2 Data Acquisition
Data acquisition deals with the collection of data (actively or passively) from the device, system, or as a result of its interactions. The data acquisition systems usually communicate with distributed devices over wired or wireless links to acquire the needed data, and need to respect security, protocol, and application requirements.
The nature of acquisition varies, e.g. it could be continuous monitoring, interval-poll, event-based, etc. The frequency of data acquisition overwhelm- ingly depends on, or is customized by, the application requirements (or their common denominator). The data acquired at this stage (for non-closed local control loops) may also differ from the data actually generated. A fraction of the generated data may be communicated. Data aggregation and even on-device computation of the data may result in communication of key performance in- dicators of interest to the application, which are calculated based on a devices own intelligence and capabilities.
4.2.3 Data Validation
Data acquired must be checked for correctness and meaningfulness within the specific operating context. The latter is usually done based on rules, semantic annotations, or other logic.
Data validation in the era of M2M, where the acquired data may not con- form to expectations, is a must as data may be intentionally or unintentionally corrupted during transmission, altered, or not make sense in the business con- text. As real-world processes depend on valid data to draw business-relevant decisions, this is a key stage, which sometimes does not receive as much atten- tion as it should. In addition, semantics may play an increasing role here, as the same data may have different meanings in various operating contexts, and via semantics one can benefit while attempting to validate them. Another part of the validation may deal with fallback actions such as requesting the data again if checks fail, or attempts to repair partially failed data.
4.2.4 Data Storage
The data generated by M2M interactions is what is commonly referred to as Big Data. Machines generate an incredible amount of information that is captured and needs to be stored for further processing. As this is proving challenging due to the size of information, a balance between its business usage vs. storage needs

4.2. DATA MANAGEMENT 47
to be considered; that is, only the fraction of the data relevant to a business need may be stored for future reference. This means, for instance, that in a specific scenario, (usually for on-the-fly data that was used to make a decision) once this is done, the processed result can be stored but not necessarily the original data.
However, one has to carefully consider what the value of such data is to business not only in current processes, but also potentially other directions that may be followed in the future by the company as different assessments of the same data may provide other, hidden competitive advantages in the future. Due to the massive amounts of M2M data, as well as their envisioned processing (e.g. searching), specialized technologies such as massively parallel processing DBs, distributed file systems, cloud computing platforms, etc. are needed.
4.2.5 Data Processing
Data processing enables working with the data that is either at rest (already stored) or is in-motion (e.g. stream data). The scope of this processing is to operate on the data at a low level and enhance them for future needs. Typical examples include data adjustment during which it might be necessary to nor- malize data, introduce an estimate for a value that is missing, re-order incoming data by adjusting timestamps, etc.
Similarly, aggregation of data or general calculation functions may be oper- ated on two or more data streams and mathematical functions applied on their composition. Missing or invalid data that is needed for the specific time-slot may be forecasted and used until, in a future interaction, the actual data comes into the system. This stage deals mostly with generic operations that can be applied with the aim to enhance them, and takes advantage of low-level (such as DB stored procedures) functions that can operate at massive levels with very low overhead, network traffic, and other limitations.
4.2.6 Data Remanence
M2M data may reveal critical business aspects, and hence their lifecycle manage- ment should include not only the acquisition and usage, but also the end-of-life of data. Even if the data is erased or removed, residues may still remain in electronic media, and may be easily recovered by third parties often referred to as data remanence.
Several techniques have been developed to deal with this, such as overwriting, degaussing, encryption, and physical destruction. For M2M, points of interest are not only the DBs where the M2M data is collected, but also the points of action, which generate the data, or the individual nodes in between, which may cache it.
At the current technology pace, those buffers (e.g. on device) are expected to be less at risk since their limited size means that after a specific time has elapsed, new data will occupy that space; hence, the window of opportunity is rather small. In addition, for large-scale infrastructures the cost of potentially

48 CHAPTER 4. DATA IN THE IOT ECOSYSTEM acquiring deleted data may be large; hence, their hubs or collection end-points,
such as the DBs who have such low cost, may be more at risk.
4.2.7 Data Analytic
Data available in the repositories can be subjected to analysis with the aim to obtain the information they encapsulate and use it for supporting decision- making processes. The analysis of data at this stage heavily depends on the domain and the context of the data.
For instance, business intelligence tools process the data with a focus on the aggregation and key performance indicator assessment.
Data mining focuses on discovering knowledge, usually in conjunction with predictive goals. Statistics can also be used on the data to assess them quantita- tively (descriptive statistics), find their main characteristics (exploratory data analysis), confirm a specific hypothesis (confirmatory data analysis), discover knowledge (data mining), and for machine learning, etc. This stage is the basis for any sophisticated applications that take advantage of the information hidden directly or indirectly on the data, and can be used, for example, for business insights, etc.
4.3 Big Data
IoT data fulfils all the characteristics of Big Data, which is usually described by the four “Vs”:
1. Volume : To be able to create good analytical models its no longer enough to analyse the data once and then discard it. Creating a valid model often requires a longer period of historic data. This means that the amount of historic data for M2M devices is expected to grow rapidly.
2. Velocity : Even though M2M devices usually report quite seldom, the sheer number of devices means that the systems will have to handle a huge number of transactions per second. Also, often the value of M2M data is strongly related to how fresh it is to be able provide the best actionable intelligence, which puts requirements on the analytical platform.
3. Variation : Given the multitude of device types used in M2M, its apparent that the variation will be very high. This is further complicated by the use of different data formats as well as different configurations for devices of the same type (e.g. where one device measures temperature in Celsius every minute, another device measures it in Fahrenheit every hour). The upside is that the data is expected to be semantically well-defined, which allows for simple transformation rules.
4. Veracity : Its imperative that we can trust the data that is analysed. There are many pitfalls along the way, such as erroneous timestamps, non-adherence to standards, proprietary formats with missing semantics,

4.4. DATA ANALYTIC 49
wrongly calibrated sensors, as well as missing data. This requires rules that can handle these cases, as well as fault-tolerant algorithms that, for example, can detect outliers (anomalies)
The following video u gives some insight to IoT data. 4.4 Data Analytic
In 2002, Donald Rumsfeld was asked a question regarding Iraq’s alleged weapons of mass destruction. In response, he uttered a now infamous description of intelligence.
Data analytic enables us to understand unknown unknowns which unleash many potentials that are not even imaginable to us now.
4.4.1 Types of IoT Data
IoT data can be in different forms, ranging from structured to completely un- structured format. The Following figure shows different types of IoT Data.
There are known knowns; there are things we know we know. … There are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.
Structured Semi structured
Quazi structured
Unstructured
Figure 4.2: Different types of IoT data.
Data that has no inher- ent structure and is usu- ally stored as different types of files
Example: text docu- ments, PDFs, images
Data containing a defined data type, format, and structure.
Example: Transaction data
textual data files with discernable pattern, enabling parsing
Example: XML data files that are self de- scribing and defined by an xml schema
Textua data with erratic data for- mats, can be formatted with efforts, tools, time
Example: web clickstream data

50 CHAPTER 4. DATA IN THE IOT ECOSYSTEM 4.4.2 Data Analytic Lifecycle
Fig. 4.3 shows the data analytic life cycle. In the following, we briefly explain each phase of this cycle.
Discovery
Operationalize
Figure 4.3: The data analytic lifecycle.
Phase 1-Discovery
In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2-Data preparation
Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data.
Phase 3-Model planning
Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
Data Prep
Communicate Results
Model Planning
Model Building

4.5. FURTHER READING 51 Phase 4-Model building
In Phase 4, the team develops datasets for testing, training, and production pur- poses. In addition, in this phase the team builds and executes models based on the work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).
Phase 5-Communicate results
In Phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
Phase 6-Operationalize
In Phase 6, the team delivers final reports, briefings, code, and technical docu- ments. In addition, the team may run a pilot project to implement the models in a production environment.
This video u gives a brief overview of data analytics for banks. Here we consider a data analytic example to better understand how data analytic can extract useful information from raw data.
4.5 Further Reading
1. Chapter 5.3 From Machine to Machine to the Internet of Things
2. Read this example on Practical Data Analysis from n
3. Watch this interesting video on Big Data by Tim Smith: u
4. Read this interesting article on Big Data: https://upside.tdwi.org/ articles/2016/05/17/understand-knowns-and-unknowns-in-data.aspx

52 CHAPTER 4. DATA IN THE IOT ECOSYSTEM