What is Distributed System

Networks of computers are everywhere. The Internet is one, as are the many networks of which it is composed. Mobile phone networks, corporate networks, factory networks, campus networks, home networks, in-car networks – all of these, both separately and in combination, share the essential characteristics that make them relevant subjects for study under the heading distributed systems. Distributed computing deals with all forms of computing, information access, and information exchange across multiple processing platforms connected by computer networks.

Overview

Over the past two decades, advancements in microelectronic technology have resulted in the availability of fast, inexpensive processors, and advancements in communication technology have resulted in the availability of cost-effective and highly efficient computer networks. The net result of the advancements in these two technologies is that the price-performance ratio has now changed to favor the use of interconnected multiple processors in place of a single, high-speed processor.

Distributed systems form a rapidly changing field of computer science. A distributed computer system consists of multiple software components that are on multiple computers but run as a single system. The computers that are in a distributed system can be physically close together and connected by a local network, or they can be geographically distant and connected by a wide area network. A distributed system can consist of any number of possible configurations, such as mainframes, personal computers, workstations, minicomputers, and so on. The goal of distributed computing is to make such a network work as a single computer. A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. This definition leads to the following especially significant characteristics of distributed systems: concurrency of components, and lack of a global clock.

Figure 1: Infrastructure for Distributed System

Distributed System Features

A distributed system can be characterized as a collection of mostly autonomous processors communicating over a communication network and having the following features:

No Common Physical Clock: This is an important assumption because it introduces the element of “distribution” in the system and gives rise to the inherent asynchrony amongst the processors.

No Shared Memory: This is a key feature that requires message-passing for communication. This feature implies the absence of the common physical clock.

Geographical Separation: The geographically wider apart that the processors are, the more representative is the system of a distributed system. However, it is not necessary for the processors to be on a wide-area network (WAN). Recently, the network/cluster of workstations (NOW/COW) configuration connecting processors on a LAN is also being increasingly regarded as a small distributed system. This NOW configuration is becoming popular because of the low-cost high-speed off-the-shelf processors now available. The Google search engine is based on the NOW architecture.

Autonomy and Heterogeneity: The processors are “loosely coupled” in that they have different speeds and each can be running a different operating system. They are usually not part of a dedicated system but cooperate with one another by offering services or solving a problem jointly.

Challenges for a Distributed System

Designing a distributed system does not come as easy and straightforward. A number of challenges need to be overcome in order to get the ideal system. The major challenges in distributed systems are listed below:

Figure 2: Overview of Challenges

Heterogeneity:  The Internet enables users to access services and run applications over a heterogeneous collection of computers and networks. Heterogeneity (that is, variety and difference) applies to all of the following:

  • Hardware Devices: computers, tablets, mobile phones, embedded devices, etc.
  • Operating System: MS Windows, Linux, Mac, Unix, etc.
  • Networks: Local network, the Internet, wireless network, satellite links, etc.
  • Programming Languages: Java, C/C++, Python, PHP, etc.
  • Different roles of software developers, designers, system managers

Transparency: Transparency is defined as the concealment from the user and the application programmer of the separation of components in a distributed system so that the system is perceived as a whole rather than as a collection of independent components. In other words, distributed systems designers must hide the complexity of the systems as much as they can.  Some terms of transparency in distributed systems are:

  • Access: Hide differences in data representation and how a resource is accessed
  • Location: Hide where a resource is located
  • Migration: Hide that a resource may move to another location
  • Relocation: Hide that a resource may be moved to another location while in use
  • Replication: Hide that a resource may be copied in several places
  • Concurrency: Hide that a resource may be shared by several competitive users
  • Failure: Hide the failure and recovery of a resource
  • Persistence: Hide whether a (software) resource is in memory or a disk

Openness: The openness of a computer system is the characteristic that determines whether the system can be extended and reimplemented in various ways. The openness of distributed systems is determined primarily by the degree to which new resource-sharing services can be added and made available for use by a variety of client programs. If the well-defined interfaces for a system are published, it is easier for developers to add new features or replace sub-systems in the future. Example: Twitter and Facebook have API that allows developers to develop their own software interactively.

Concurrency: Both services and applications provide resources that can be shared by clients in a distributed system. There is therefore a possibility that several clients will attempt to access a shared resource at the same time. For example, a data structure that records bids for an auction may be accessed very frequently when it gets close to the deadline time. For an object to be safe in a concurrent environment, its operations must be synchronized in such a way that its data remains consistent. This can be achieved by standard techniques such as semaphores, which are used in most operating systems.

Security: Many of the information resources that are made available and maintained in distributed systems have a high intrinsic value to their users. Their security is therefore of considerable importance. Security for information resources has three components:

  • Confidentiality (protection against disclosure to unauthorized individuals)
  • Integrity (protection against alteration or corruption),
  • Availability for the authorized (protection against interference with the means to access the resources).

Scalability: Distributed systems must be scalable as the number of users increases. A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance or increase in administrative complexity

Scalability has 3 dimensions:

Size: Number of users and resources to be processed. The problem associated is overloading

Geography: Distance between users and resources. The problem associated is communication reliability

Administration: As the size of distributed systems increases, many of the systems need to be controlled. The problem associated is an administrative mess

Failure Handling: Computer systems sometimes fail. When faults occur in hardware or software, programs may produce incorrect results or may stop before they have completed the intended computation. The handling of failures is particularly difficult.

References

[1] Kshemkalyani, Ajay D., and Mukesh Singhal, “Distributed computing: principles, algorithms, and systems”, Cambridge University Press, 2011.

[2] George Coulouris and Jean Dollimore, “Distributed Systems: Concepts and Design”, Pearson education, 2005.

[3] S.G. Bhagwath and Dr. Mallikarjun Math, “Distributed Systems and Recent Innovations: Challenges Benefits and Security Issues in Distributed Systems”, Bonfring International Journal of Software Engineering and Soft Computing, Vol. 6, Special Issue, October 2016

[4] Nadiminti, Krishna and Rajkumar Buyya, “Distributed systems and recent innovations: Challenges and benefits”, InfoNet Magazine 16.3 (2006): 1-5.

 

What is Distributed Database

A distributed database is a database in which portions of the database are stored in multiple physical locations and processing is distributed among multiple database nodes. Distributed databases can be homogenous or heterogeneous. In a homogenous distributed database system, all the physical locations have the same underlying hardware and run the same operating systems and database applications. In a heterogeneous distributed database, the hardware, operating systems, or database applications may be different at each of the locations.

Overview

A distributed database is a database distributed between several sites. The reasons for the data distribution may include the inherently distributed nature of the data or performance reasons. In a distributed database the data at each site is not necessarily an independent entity but can be rather related to the data stored on the other sites.  A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (DDBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the user. A distributed database system (DDBS) is the integration of DDB and DDBMS. This integration is achieved through the merging of the database and networking technologies together [2].

A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites. To ensure that the distributive databases are up-to-date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same [3].

Database Management System

DBMS which is an integral and indispensable component is outsourced is attractive because energy, hardware, and DBaas (Database as a Service) are minimized. This survey deals with determining the workload for a multi-tenancy environment, elastic scalability, and an adjustable security scheme to run over encrypted data. This survey also studies the efficient and scalable ACID transactions in the cloud by decomposing functions of a database storage engine into transactional components and Data Components.

Figure 1: Distributed Database

It is important to understand the difference between distributed and decentralized databases. A decentralized database is moreover stored on computers at manifold locations; but, these computers are not connected by network and database software thus the data does not materialize to be in one logical database. Therefore, users at the dissimilar sites cannot admittance data. A decentralized database is mostly recognized as a collection of independent databases, not the geographical allotment of a single database. For dissimilar business situations, the use of disseminated databases is amplified:

  • In modern associations business units Divisions, departments, and facilities are frequently organically dispersed, often across dissimilar countries. Every unit can create its possess information systems, and these units want local data more which they can include control.
  • Data infrastructure costs and consistency The cost to ship big quantities of data transversely to an infrastructure network or to handle a large volume of communication from remote resources can still be high, even if data communication costs have reduced substantially recently. It is in various cases additional economical to locate data and request close to where they are required. Moreover, dependence on data infrastructure forever involves a component of risk, so keeping local copies or fragments of data can be a consistent method to support the requirement for rapid admittance of data transversely to the association.
  • Database recovery- replicating data on divided computers is one strategy for certifying that a damaged database can be quickly improved and users can include admittance to data while the major site is organism restored. Replicating data transversely to multiple computer sites is one natural form of a disseminated database.
  • Satisfying together operation and analytical dispensation the needs for database organization varies across OLTP and OLAP applications. Yet, similar data are in common among the two databases supporting every kind of application. Distributed database technology can be helpful in synchronizing data across OLTP and OLAP platforms.

References

[1] Joshi, Himanshu, and G. R. Bamnote, “Distributed database: A survey”, International Journal of Computer Science and Applications 6.2 (2013).

[2] Gupta, Swati, and Kuntal Saroha, “Fundamental research in distributed database”, IJCSMS 11.2 (2011).

[3] Stanchev, Lubomir, “Survey Paper for CS748T Distributed Database Management Lecturer” (2001).

What is Deep Learning : A Basic Concept

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts, or products with users’ interests, and select relevant results of a search. Increasingly, these applications make use of a class of techniques called deep learning. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example.

Overview

Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. Deep learning is a subset of machine learning. Usually, when people use the term deep learning, they are referring to deep artificial neural networks, and somewhat less frequently to deep reinforcement learning.

Deep is a technical term. It refers to the number of layers in a neural network. A shallow network has one so-called hidden layer, and a deep network has more than one. Multiple hidden layers allow deep neural networks to learn features of the data in a so-called feature hierarchy, because simple features (e.g. two pixels) recombine from one layer to the next, to form more complex features (e.g. a line). Nets with many layers pass input data (features) through more mathematical operations than nets with few layers and are therefore more computationally intensive to train. Computational intensity is one of the hallmarks of deep learning, and it is one reason why GPUs are in demand to train deep-learning models.

Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign or to distinguish a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving results that were not possible before.

Examples of Deep Learning at Work

Deep learning applications are used in industries from automated driving to medical devices.

Automated Driving: Automotive researchers are using deep learning to automatically detect objects such as stop signs and traffic lights. In addition, deep learning is used to detect pedestrians, which helps decrease accidents.

Aerospace and Defense: Deep learning is used to identify objects from satellites that locate areas of interest, and identify safe or unsafe zones for troops.

Medical Research: Cancer researchers are using deep learning to automatically detect cancer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional data set used to train a deep learning application to accurately identify cancer cells.

Industrial Automation: Deep learning is helping to improve worker safety around heavy machinery by automatically detecting when people or objects are within an unsafe distance of machines.

Electronics: Deep learning is being used in automated hearing and speech translation. For example, home assistance devices that respond to your voice and know your preferences are powered by deep learning applications.

How Deep Learning Works

Most deep learning methods use neural network architectures, which is why deep learning models are often referred to as deep neural networks.

The term “deep” usually refers to the number of hidden layers in the neural network. Traditional neural networks only contain 2-3 hidden layers, while deep networks can have as many as 150. Deep learning models are trained by using large sets of labeled data and neural network architectures that learn features directly from the data without the need for manual feature extraction.

Figure 1 Neural networks, which are organized in layers consisting of a set of interconnected nodes

One of the most popular types of deep neural networks is known as convolutional neural networks (CNN or ConvNet). A CNN convolves learned features with input data and uses 2D convolutional layers, making this architecture well-suited to processing 2D data, such as images.

Difference between Machine Learning and Deep Learning

Deep learning is a specialized form of machine learning. A machine learning workflow starts with relevant features being manually extracted from images. The features are then used to create a model that categorizes the objects in the image. With a deep learning workflow, relevant features are automatically extracted from images. In addition, deep learning performs “end-to-end learning” – where a network is given raw data and a task to perform, such as classification, and it learns how to do this automatically.

Another key difference is deep learning algorithms scale with data, whereas shallow learning converges. Shallow learning refers to machine learning methods that plateau at a certain level of performance when you add more examples and training data to the network.

A key advantage of deep learning networks is that they often continue to improve as the size of your data increases.

Figure 2: Comparing a machine learning approach to categorizing vehicles (left) with deep learning (right)

In machine learning, you manually choose features and a classifier to sort images. With deep learning, feature extraction, and modeling steps are automatic.

References

[1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton, “Deep learning”, Nature 521, Number 7553 (2015): pp. 436-444.

[2] “What Is Deep Learning? 3 things you need to know”, available online at: https://in.mathworks.com/discovery/deep-learning.html

[3] “Artificial Intelligence, Machine Learning, and Deep Learning”, available online at: https://deeplearning4j.org/ai-machinelearning-deeplearning

[4] Navdeep, Ingestion, and Processing of Data for Big Data and IoT Solutions March 03, 2017, available online at: https://www.xenonstack.com/blog/ingestion-processing-data-for-big-data-iot-solutions

What is Delay Tolerant Network (DTN)

Technology has shown significant potential in developing countries, as appropriate designs matched with real-world needs can effectively bridge information gaps, provide greater transparency, and improve communication efficiency. Unfortunately, many growing regions’ environments lack affordable network connectivity. Even where there is connectivity, networks are often characterized by frequent, lengthy, and unpredictable link outages, along with limited bandwidth and congested usage. Comparing the prevalence of access to information technology across regions, one finds marked distinctions between generally industrialized (and “wired”) countries and a large number of developing nations that lack connectivity and access to technology.

Overview

Communicating from Earth to any spacecraft is a complex challenge, largely due to the extreme distances involved. When data are transmitted and received across thousands and even millions of miles, the delay and potential for disruption or data loss is significant. Through the network, transmission has been a very easy thing at current, but it is still very difficult to transmit data in some networks that are often in delay and interrupted. Meanwhile, network interruption and delay are very common due to various reasons such as the change in network topology or harsh environment, etc. For these many researchers have proposed many solutions over the last decades.

Though these mechanisms all try to address the issue based on the conventional network protocols these mechanisms are not feasible in some specific cases, which resulted in the concept of DTN.

What is it?

A delay tolerant network is a newly emerging network, which usually deals with communications in extremely challenging environments, such as space communications and networking in sparsely populated areas, vehicular ad hoc networks, and underwater sensor networking.  A Delay-Tolerant Network (DTN) is a network designed to operate effectively over extreme distances such as those encountered in space communications or on an interplanetary scale. In such an environment, long latency sometimes measured in hours or days is inevitable. However, similar problems can also occur over more modest distances when interference is extreme or network resources are severely overburdened.

Delay Tolerant Networking (DTN) is an approach to networking, which handles network disruptions and high delays that may occur in many kinds of communication networks. The primary reasons for high delay include partial connectivity of networks as can be seen in many types of ad hoc wireless networks with frequent network partitions, long propagation time as experienced in inter-planetary and deep space networks, and regular link disruptions due to the mobility of nodes as observed in terrestrial wireless network environments.

Figure 1 Example of Delay Tolerant Networking (DTN)

In a delay-tolerant network, traffic can be classified in three ways, called expedited, normal, and bulk in order of decreasing priority. Expedited packets are always transmitted, reassembled, and verified before data of any other class from a given source to a given destination. Normal traffic is sent after all expedited packets have been successfully assembled at their intended destination. Bulk traffic is not dealt with until all packets of other classes from the same source and bound for the same destination have been successfully transmitted and reassembled.

The basic features of a Delay-Tolerant Network

DTN network has the following basic features:

  • Intermittent Connection: As the node’s mobility and energy are limited, DTN frequently disconnects, thus resulting in continued change in DTN topology. That is to say, the network keeps the status of the intermittent connection and partial connection so that there is no guarantee to achieve an end-to-end route.
  • High delay, low efficiency, and high queue delay: End-to-end delay indicates the sum of the total delay of each hop on the route. The delay consists of waiting time, queuing time, and transmission time. Each hop delay might be very high due to the fact that DTN intermittent connection keeps unreachable for a very long time and thus further leading to a lower data rate and showing the asymmetric features in the up-down link data rate. In addition, queuing delay plays a main role in end-to-end delay and frequent fragmentations in DTN make queuing delay increasing.
  • Limited Resource: Node’s computing and processing ability, communication ability, and storage space are weaker than the function of an ordinary computer due to the constraints of price, volume, and power. In addition, the limited storage space resulted in a higher packet loss rate.
  • Limited Life Time of Node: In some special circumstances of the restricted network, the node commonly uses the battery power in the state of hostile environments or in harsh conditions, which will cut the lifetime of the node. When the power is off, then the node cannot guarantee normal work. That is to say, it is very possible the power is off when the message is being transmitted.
  • Dynamic Topology: Note that the DTN topology is dynamically changing for some reasons such as environmental changes, energy depletion, or other failures, which results in dropping out of the network. Or, the requirements of entering DTN also make topology change.
  • Poor Security: In general, DTN is vulnerable to–besides threats of wireless communication network–eavesdropping, message modification, routing spoofing, Denial of Service (DoS), and other security threats, etc, due to the lack of specialized services and maintenance in real-world.
  • Heterogeneous Interconnection: DTN is an overlay network for the transmission of asynchronous messages. Introducing the bundle layer, the DTN can run on different heterogeneous network protocol stacks and the DTN gateway ensures the reliable transmission of interconnection message

How DTN Works?

DTN is a computer networking model and a system of rules for transmitting information, often referred to as a protocol suite that extends the terrestrial Internet capabilities into the challenging communication environments in space where the conventional Internet does not work well. These environments are typically subject to frequent disruptions, links that are limited to one direction, possibly long delays, and high error rates.

The DTN protocol suite can operate in tandem with the terrestrial IP suite or it can operate independently. DTN provides assured delivery of data using automatic store-and-forward mechanisms. Each data packet that is received is forwarded immediately if possible, but stored for future transmission if forwarding is not currently possible but is expected to be possible in the future. As a result, only the next hop needs to be available when using DTN.

The DTN suite also contains network management, security, routing, and quality-of-service capabilities, which are similar to the capabilities provided by the terrestrial Internet suite. Even though DTN was developed with space applications in mind, the benefits hold true for terrestrial applications where frequent disruptions and high-error rates are common. Some examples include disaster response and wireless sensor networks.

  • Improved Operations and Situational Awareness: The DTN store-and-forward mechanism along with automatic retransmission provides more insight into events during communication outages that occur as a result of relay or ground station handovers and poor atmospheric conditions, and significantly reduces the need to schedule ground stations to send or receive data, which can sometimes require up to five days of planning before transmission takes place.
  • Interoperability and Reuse: A standardized DTN protocol suite enables the interoperability of ground stations and spacecraft operated by any space agency or private entity with space assets. It also allows NASA to use the same communication protocols for future missions (low-Earth orbit, near-Earth orbit, or deep space).
  • Space Link Efficiency, Utilization, and Robustness: DTN enables more reliable and efficient data transmissions resulting in more usable bandwidth. DTN also improves link reliability by having multiple network paths and assets for potential communication hops.
  • Security: The DTN Bundle Protocol Security allows for integrity checks, authentication, and encryption, even on links where not previously used.
  • Quality-of-Service: The DTN protocol suite allows for many priority levels to be set for different data types, ensuring that the most important data is received ahead of less important data

References

[1] Harminder Singh Bindra and Amrit Lal Sangal, “Considerations and Open Issues in Delay Tolerant Network’s (DTNs) Security”, Wireless Sensor Network, 2010, Volume 2, pp. 645-648

[2] Wei Suna, Congmin Liu and Dan Wang, “On Delay-Tolerant Networking and Its Application”, 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011), IACSIT Press, Singapore

[3] “Disruption Tolerant networking”, available online at: https://www.nasa.gov/content/dtn

[4] “delay-tolerant network”, available online at: http://searchnetworking.techtarget.com/definition/delay-tolerant-network

[5] Demmer, Michael Joshua, “A delay tolerant networking and system architecture for developing regions”, PhD dissertation, University of California, Berkeley, 2008.

What is a Multi Agent System

Multi-agent systems are made up of multiple interacting intelligent agents—computational entities to some degree autonomous and able to cooperate, compete, communicate, act flexibly, and exercise control over their behavior within the frame of their objectives. They are the enabling technology for a wide range of advanced applications relying on distributed and parallel processing of data, information, and knowledge relevant in domains ranging from industrial manufacturing to e-commerce to health care.

What is it?

In artificial intelligence research, agent-based systems technology has been hailed as a new paradigm for conceptualizing, designing, and implementing software systems. Agents are sophisticated computer programs that act autonomously on behalf of their users, across open and distributed environments, to solve a growing number of complex problems. Increasingly, however, applications require multiple agents that can work together. A multi-agent system (MULTI-AGENT SYSTEM) is a loosely coupled network of software agents that interact to solve problems that are beyond the individual capacities or knowledge of each problem solver. The multi-agent system can be defined by the following definition:

“A multi-agent system is a loosely coupled network of problem-solving entities (agents) that work together to find answers to problems that are beyond the individual capabilities or knowledge of each entity (agent)”.

The trend toward the development of increasingly intelligent systems is matched only by the trend toward the distribution of computing. The science of multiagent systems lies at the intersection of these trends. Multi-agent systems are of great significance in a number of current and future applications of computer science. For example, they arise in systems for electronic data interchange, air traffic control, manufacturing automation, computer-supported cooperative work, and electronic banking, as well as in robotics and heterogeneous information systems.

An agent is a computerized entity like a computer programmer or a robot. An agent can be described as autonomous because it has the capacity to adapt when its environment changes. A multi-agent system is made up of a set of computer processes that occur at the same time, i.e. several agents that exist at the same time, share common resources, and communicate with each other. The key issue in multi-agent systems is to formalize the coordination between agents. Research on agents, therefore, includes research into:

  • Decision-making: what decision-making mechanisms are available to the agent? What is the link between their perceptions, representations, and actions?
  • Control: what hierarchic relationships exist between agents? How are they synchronized?
  • Communication: what kind of messages do they send each other? What syntax do these messages obey?
  • Figure 1: Multi-Agent System Cooperation typology
  • Multi-agent systems can be applied to artificial intelligence. They simplify problem-solving by dividing the necessary knowledge into subunits-to which an independent intelligent agent is associated-and by coordinating the agents’ activity. In this way, we refer to distributed artificial intelligence. This method can be used for monitoring an industrial process, for example, when the sensible solution -that of coordinating several specialized monitors rather than a single omniscient one- is adopted.

    The fact that the agents within a MULTI-AGENT SYSTEM work together implies that a sort of cooperation among individual agents is to be involved. However, the concept of cooperation in a MULTI-AGENT SYSTEM is at best unclear and at worst highly inconsistent, so that the terminology, possible classifications, etc., are even more problematic than in the case of agents what makes any attempt to present MULTI-AGENT SYSTEM a hard problem. A typology of cooperation seems the simplest and here we start with this typology as the basis for MULTI-AGENT SYSTEM classification. The typology is given in Figure 4.

    Advantages of a Multi-Agent Approach

    A MULTI-AGENT SYSTEM has the following advantages over a single-agent or centralized approach

    • A MULTI-AGENT SYSTEM distributes computational resources and capabilities across a network of interconnected agents. Whereas a centralized system may be plagued by resource limitations, performance bottlenecks, or critical failures, a MULTI-AGENT SYSTEM is decentralized and thus does not suffer from the “single point of failure” problem associated with centralized systems.
    • A MULTI-AGENT SYSTEM allows for the interconnection and interoperation of multiple existing legacy systems. By building an agent wrapper around such systems, they can be incorporated into an agent society.
    • A MULTI-AGENT SYSTEM models problems in terms of autonomous interacting component agents, which is proving to be a more natural way of representing task allocation, team planning, user preferences, open environments, and so on.
    • A MULTI-AGENT SYSTEM efficiently retrieves, filters, and globally coordinates information from sources that are spatially distributed.
    • A MULTI-AGENT SYSTEM provides solutions in situations where expertise is spatially and temporally distributed.
    • A MULTI-AGENT SYSTEM enhances overall system performance, specifically along the dimensions of computational efficiency, reliability, extensibility, robustness, maintainability, responsiveness, flexibility, and reuse.

    References

    [1] Mevludin Glavic, “Agents and multi-agent systems: a short introduction for power engineers”, Technical Report, May 2006.

    [2] “Multi-Agent Systems”, available online at: http://cormas.cirad.fr/en/demarch/sma.htm

    [3] “Multi-Agent Systems”, available online at: https://www.cs.cmu.edu/~softagents/multi.html

What is Human-Computer Interaction (HCI)

Utilizing computers had always begged the question of interfacing. The methods by which human has been interacting with computers has traveled a long way. The journey still continues and new designs of technologies and systems appear more and more every day and the research in this area has been growing very fast in the last few decades. The growth in Human-Computer Interaction (HCI) field has not only been in quality of interaction, it has also experienced different branching in its history. Instead of designing regular interfaces, the different research branches have had a different focus on the concepts of multimodality rather than unimodality, intelligent adaptive interfaces rather than command/action-based ones, and finally active rather than passive interfaces.

Overview

Human-Computer Interaction (HCI) involves the planning and design of the interaction between users and computers. These days, smaller devices are used to improve technology. The most important advantage of computer vision is its freedom. The user can interact with the computer without wires and manipulating intermediary devices. Recently, User-Interfaces are used to capture the motion of our hands. The researchers developed techniques to track the movements of hand/fingers through the webcam to establish an interaction mechanism between the user and computer

Sometimes called Man-Machine Interaction or Interfacing, the concept of Human-Computer Interaction/Interfacing (HCI) was automatically represented by the emergence of the computer, or more generally machine, itself. The reason, in fact, is clear: most sophisticated machines are worthless unless they can be used properly by men. This basic argument simply presents the main terms that should be considered in the design of HCI: functionality and usability [1]

One important HCI factor is that different users form different conceptions or mental models about their interactions and have different ways of learning and keeping knowledge and skills (different “cognitive styles” as in, for example, “left-brained” and “right-brained” people). In addition, cultural and national differences play a part. Another consideration in studying or designing HCI is that user interface technology changes rapidly, offering new interaction possibilities to which previous research findings may not apply. Finally, user preferences change as they gradually master new interfaces.

Figure 1: Field of Human-Computer Interaction

Significance

Human-computer interaction (HCI) is the study of how people design, implement, and use interactive computer systems and how computers affect individuals, organizations, and society. This encompasses not only ease of use but also new interaction techniques for supporting user tasks, providing better access to information, and creating more powerful forms of communication. It involves input and output devices and the interaction techniques that use them; how information is presented and requested; how the computer’s actions are controlled and monitored; all forms of help, documentation, and training; the tools used to design, build, test, and evaluate user interfaces; and the processes that developers follow when creating Interfaces.

Human-computer interaction (HCI) is the study of how people design, implement, and use interactive computer systems and how computers affect individuals, organizations, and society. This encompasses not only ease of use but also new interaction techniques for supporting user tasks, providing better access to information, and creating more powerful forms of communication. It involves input and output devices and the interaction techniques that use them; how information is presented and requested; how the computer’s actions are controlled and monitored; all forms of help, documentation, and training; the tools used to design, build, test, and evaluate user interfaces; and the processes that developers follow when creating Interfaces.

The goal of Human-Computer Interaction

The goals of HCI are to produce usable and safe systems, as well as functional systems. Usability is concerned with making systems easy to learn and easy to use [36]. In order to produce computer systems with good usability developers must attempt to:

  • Understand the factors that determine how people use technology
  • Develop tools and techniques to enable the building suitable systems
  • Achieve efficient, effective, and safe interaction
  • Put user first

Underlying the whole theme of HCI is the belief that people using a computer system should come first. Their needs, capabilities, and preferences for conducting various tasks should direct developers in the way that they design systems. People need not change themselves in order to fit in within the system. Instead, the system should be designed to match their requirements.

References

[1] Fakhreddine Karray and Milad Alemzadeh, “Human-Computer Interaction: Overview on State of the Art”, International Journal on Smart Sensing and Intelligent Systems, Volume 1, Number 1, March 2008

[2] Kinjal N. Shah and Kirit R. Rathod, “A survey on Human-Computer Interaction Mechanism Using Finger Tracking”, International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 3– Jan 2014

[3] “Chapter 1: Introduction”, available online at: http://shodhganga.inflibnet.ac.in/bitstream/10603/13990/6/06_chapter_1.pdf

What is Data Visualization and Applications

A picture is worth a thousand words – especially when we are trying to understand and discover insights from data. Visuals are especially helpful when we’re trying to find relationships among hundreds or thousands of variables to determine their relative importance – or if they are important at all. Regardless of how much data we have, one of the best ways to discern important relationships is through advanced analysis and high-performance data visualization. If sophisticated analyses can be performed quickly, even immediately, and results presented in ways that showcase patterns and allow querying and exploration, people across all levels in our organization can make faster, more effective decisions.

Definition

Data visualizations are surprisingly common in our everyday life, but they often appear in the form of well-known charts and graphs. A combination of multiple visualizations and bits of information is often referred to as infographics. Data visualizations can be used to discover unknown facts and trends. You may see visualizations in the form of line charts to display change over time. Bar and column charts are helpful when observing relationships and making comparisons. Pie charts are a great way to show parts of a whole. And maps are the best way to visually share geographical data.

“Data visualization is the presentation of quantitative information in a graphical form. In other words, data visualizations turn large and small datasets into visuals that are easier for the human brain to understand and process”.

Data visualization concerns the manipulation of sampled and computed data for comprehensive display. The goal of the data visualization is to bring to the user a deeper understanding of the data as well as the underlying physical laws and properties. Such visualization may be used to enlighten a physicist on the complex interaction between electrons, to guide the medical practitioner in a surgery situation, or simply to view the surface of a planet, which has never been seen by human eyes.

The important aspects of interactive visualization can be broken down into three categories:

Computation- the ability to speedily compute visualization, this may include computing a polygonal approximation to is surface of a scalar function, the computation of a particle trace through a time-dependent vector field, or any action which requires extracting an abstract object or representation from the data being examined.

Display- the ability to quickly display the computed visualization, display encompasses both computed visualizations as listed above, as well as direct display methods such as volume visualization and ray tracing.

Querying- the ability to interactively probe a displayed visualization for the purpose of further understanding on a fine scale what is begin displayed on a coarser scale.

Importance of Data Visualization

Better Decision Making

Today more than ever, organizations are using data visualizations, and data tools, to ask better questions and make better decisions. Emerging computer technologies and new user-friendly software programs have made it easy to learn more about your company and make better data-driven business decisions. The strong emphasis on performance metrics, data dashboards, and Key Performance Indicators (KPIs) shows the importance of measuring and monitoring company data. Common quantitative information measured by businesses includes units or product sold revenue by quarter, department expenses, employee stats, and company market share.

  • Meaningful Storytelling: Data visualizations and information graphics (infographics) have become essential tools for today’s mainstream media. Data journalism is on the rise and journalists consistently rely on quality visualization tools to help them tell stories about the world around us. Many well-respected institutions have fully embraced data-driven news including The New York Times, The Guardian, The Washington Post, Scientific American, CNN, Bloomberg, The Huffington Post, and The Economist.
  • Data Literacy: Being able to understand and read data visualizations has become a necessary requirement for the 21st century. Because data visualization tools and resources have become readily available, more and more non-technical professionals are expected to be able to gather insights from data.

Data visualization, the use of images to represent information, is only now becoming properly appreciated for the benefits it can bring to business. It provides a powerful means both to make sense of data and to then communicate what we’ve discovered to others. Despite their potential, the benefits of data visualization are undermined today by a general lack of understanding. Many of the current trends in data visualization are actually producing the opposite of the intended effect, confusion rather than understanding. Nothing going on in the field of business intelligence today can bring us closer to fulfilling its promise of intelligence in the workplace than data visualization.

The Importance of Visualizations in Business

A visual can communicate more information than a table in a much smaller space. This trait of visuals makes them more effective than tables for presenting data. For example, notice the table below, and try to spot the month with the highest sales.

Month Jan Feb Mar Apr May Jun
Sales 45 56 36 58 75 62

This data when visualized gives you the same information in a second or two.

Figure 1 An example of data visualization

Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.’ This trait of visualizations is what makes them vital to businesses

References

[1] Chandrajit Bajaj, “Data Visualization Techniques”, 1998 John Wiley & Sons Ltd

[2] “What is Data Visualization?” available online at: https://infogram.com/page/data-visualization

[3] “Principles of Data Visualization – What We See in a Visual”, White Paper, FusionCharts

[4] Stephen Few and Perceptual Edge, “Data Visualization Past, Present, and Future”, Innovation Center, Wednesday, January 10, 2007

What is Reinforcement Learning in Machine Learning

The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment.

General Overview

Reinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics–trial-and-error search and delayed reward–are the two most important distinguishing features of reinforcement learning.  Reinforcement Learning is a type of Machine Learning, thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the reinforcement signal.

Definition

Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner’s predictions. Further, the predictions may have long-term effects by influencing the future state of the controlled system. Thus, time plays a special role. The goal of reinforcement learning is to develop efficient learning algorithms.

Figure 1: The Basic Reinforcement Learning Scenario

Reinforcement learning is defined not by characterizing learning methods, but by characterizing a learning problem. Any method that is well suited to solving that problem, we consider to be a reinforcement learning method. Reinforcement learning is different from supervised learning, the kind of learning studied in most current research in machine learning, statistical pattern recognition, and artificial neural networks. Supervised learning is learning from examples provided by a knowledgeable external supervisor. This is an important kind of learning, but alone it is not adequate for learning from interaction. In interactive problems, it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In uncharted territory– where one would expect learning to be most beneficial–an agent must be able to learn from its own experience.

Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main sub-elements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus-response rules or associations. In some cases, the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic.

A reward signal defines the goal in a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what the good and bad events are for the agent. In a biological system, we might think of rewards as analogous to the experiences of pleasure or pain. They are the immediate and defining features of the problem faced by the agent. As such, the process that generates the reward signal must be unalterable by the agent. The agent can alter the signal that the process produces directly by its actions and indirectly by changing its environment’s state— since the reward signal depends on these—but it cannot change the function that generates the signal.

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state. Expressed this way, we hope it is clear that value functions formalize a basic and familiar idea.

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Applications

One reason that reinforcement learning is popular is that it serves as a theoretical tool for studying the principles of agents learning to act. But it is unsurprising that it has also been used by a number of researchers as a practical computational tool for constructing autonomous systems that improve themselves with experience. These applications have ranged from robotics to industrial manufacturing, to combinatorial search problems such as computer game playing. Some of the practical applications of reinforcement learning are:

Manufacturing: In Fanuc, a robot uses deep reinforcement learning to pick a device from one box and put it in a container. Whether it succeeds or fails, it memorizes the object and gains knowledge and train’s it to do this job with great speed and precision.

Inventory Management: A major issue in supply chain inventory management is the coordination of inventory policies adopted by different supply chain actors, such as suppliers, manufacturers, and distributors, so as to smooth material flow and minimize costs while responsively meeting customer demand.

Delivery Management: Reinforcement learning is used to solve the problem of Split Delivery Vehicle Routing. Q-learning is used to serve appropriate customers with just one vehicle.

Power Systems: Reinforcement Learning and optimization techniques are utilized to assess the security of the electric power systems and to enhance Microgrid performance. Adaptive learning methods are employed to develop control and protection schemes. Transmission technologies with High-Voltage Direct Current (HVDC) and Flexible Alternating Current Transmission System devices (FACTS) based on adaptive learning techniques can effectively help to reduce transmission losses and CO2 emissions.

Finance Sector: AI is at the forefront of leveraging reinforcement learning for evaluating trading strategies. It is turning out to be a robust tool for training systems to optimize financial objectives. It has immense applications in stock market trading where the Q-Learning algorithm is able to learn an optimal trading strategy with one simple instruction.

References

[1] Szepesvári, Csaba, “Algorithms for reinforcement learning”, Synthesis lectures on artificial intelligence and machine learning 4, no. 1 (2010): 1-103.

[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1, no. 1, Cambridge: MIT Press, 1998.

[3] Maruti Techlabs, “Reinforcement Learning and Its Practical Applications”, available online at: https://chatbotsmagazine.com/reinforcement-learning-and-its-practical-applications-8499e60cf751.

What is Natural Language processing (NLP)

Natural language processing (NLP) is the relationship between computers and human language. More specifically, natural language processing is the computer understanding, analysis, manipulation, and/or generation of natural language. Will a computer program ever be able to convert a piece of English text into a programmer-friendly data structure that describes the meaning of the natural language text? Unfortunately, no consensus has emerged about the form or the existence of such a data structure. Until such fundamental Artificial Intelligence problems are resolved, computer scientists must settle for the reduced objective of extracting simpler representations that describe limited aspects of the textual information.

Overview

Natural language processing (NLP) can be defined as the automatic (or semi-automatic) processing of human language. The term ‘NLP’ is sometimes used rather more narrowly than that, often excluding information retrieval and sometimes even excluding machine translation. NLP is sometimes contrasted with ‘computational linguistics’, with NLP being thought of as more applied. Nowadays, alternative terms are often preferred, like ‘Language Technology’ or ‘Language Engineering’. Language is often used in contrast with speech (e.g., Speech and Language Technology). But I’m going to simply refer to NLP and use the term broadly. NLP is essentially multidisciplinary: it is closely related to linguistics (although the extent to which NLP overtly draws on linguistic theory varies considerably).

What is it?

NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. NLP is used to analyze text, allowing machines to understand how humans speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question-answering.

Figure1: NLP Techniques

Importance of NLP

Earlier approaches to NLP involved a more rules-based approach, where simpler machine learning algorithms were told what words and phrases to look for in text and given specific responses when those phrases appeared. But deep learning is a more flexible, intuitive approach in which algorithms learn to identify speakers’ intent from many examples, almost like how a child would learn human language.

The advantage of natural language processing can be seen when considering the following two statements: “Cloud computing insurance should be part of every service level agreement” and “A good SLA ensures an easier night’s sleep — even in the cloud.” If you use national language processing for search, the program will recognize that cloud computing is an entity, that cloud is an abbreviated form of cloud computing, and that SLA is an industry acronym for service level agreement.

Some Linguistic Terminology

The subareas loosely correspond to some of the standard subdivisions of linguistics:

  • Morphology: the structure of words. For instance, unusually can be thought of as composed of a prefix un-, a stem usual, and an affix -ly. Composed is compose plus the inflectional affix -ed: a spelling rule means we end up with composed rather than composed.
  • Syntax: The way words are used to form phrases. e.g., it is part of English syntax that a determiner such as the will come before a noun, and also that determiners are obligatory with certain singular nouns.
  • Semantics: Compositional semantics is the construction of meaning (generally expressed as logic) based on syntax. This is contrasted to lexical semantics, i.e., the meaning of individual words.

Application

Here are a few common ways NLP is being used today:

  • Spell check functionality in Microsoft Word is the most basic and well-known application.
  • Text analysis, also known as sentiment analytics, is a key use of NLP. Businesses can use it to learn how their customers feel emotionally and use that data to improve their service.
  • By using email filters to analyze the emails that flow through their servers, email providers can use Naive Bayes spam filtering to calculate the likelihood that an email is spam based its content.
  • Call center representatives often hear the same, specific complaints, questions, and problems from customers. Mining this data for sentiment can produce incredibly actionable intelligence that can be applied to product placement, messaging, design, or a range of other uses.
  • Google, Bing, and other search systems use NLP to extract terms from text to populate their indexes and parse search queries
  • Google Translate applies machine translation technologies in not only translating words, but also in understanding the meaning of sentences to improve translations.
  • Financial markets use NLP by taking plain-text announcements and extracting the relevant info in a format that can be factored into making algorithmic trading decisions. For example, news of a merger between companies can have a big impact on trading decisions, and the speed at which the particulars of the merger (e.g., players, prices, who acquires who) can be incorporated into a trading algorithm can have profit implications in the millions of dollars.

A Few NLP Examples

  • Use Summarizer to automatically summarize a block of text, exacting topic sentences, and ignoring the rest.
  • Generate keyword topic tags from a document using LDA (Latent Dirichlet Allocation), which determines the most relevant words from a document. This algorithm is at the heart of the Auto-Tag and Auto-Tag URL micro-services
  • Sentiment Analysis, based on Stanford NLP, can be used to identify the feeling, opinion, or belief of a statement, from very negative, to neutral, to very positive.

References

[1] Ann Copestake, “Natural Language Processing”, 2004, 8 Lectures, available online at: https://www.cl.cam.ac.uk/teaching/2002/NatLangProc/revised.pdf

[2] Ronan Collobert and Jason Weston, “Natural Language Processing (Almost) from Scratch”, Journal of Machine Learning Research 12 (2011) pp. 2493-2537

[3] “Top 5 Semantic Technology Trends to look for in 2017”, available online at: https://ontotext.com/top-5-semantic-technology-trends-2017/

What is Data Preprocessing

Data analysis is now integral to our working lives. It is the basis for investigations in many fields of knowledge, from science to engineering and management to process control. Data on a particular topic are acquired in the form of symbolic and numeric attributes. Analysis of these data gives a better understanding of the phenomenon of interest. When the development of a knowledge-based system is planned, the data analysis involves the discovery and generation of new knowledge for building a reliable and comprehensive knowledge base. Exploratory data analysis and predictive analytics can be used to extract hidden patterns from data and are becoming increasingly important tools to transform data into information. Real-world data is generally incomplete and noisy and is likely to contain irrelevant and redundant information or errors. Data preprocessing is an essential step in data mining processes, helps transform the raw data into an understandable format.

Data pre-processing is an essential step in the data mining process. It describes any type of processing performed on raw data to prepare it for another processing procedure. Data preprocessing transforms the data into a format that will be more efficiently and effectively processed for the user. Data pre-processing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions for subsequent analysis. Through this, the nature of the data is better understood and the data analysis is performed more accurately and efficiently. Data preprocessing is used in database-driven applications such as customer relationship management and rule-based applications (like neural networks).

Importance of Data Pre-processing

Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability. Real-world data is usually incomplete, (it may contain missing values), noisy, (data may contain errors while transmission or dirty data), and inconsistent, (data may contain duplicate values or unexpected values which lead to inconsistency).  Data preprocessing is a proven method of solving such problems.

No quality data, no quality mining results! which means that if the analysis is performed on low-quality data then the results obtained will also be of a low quality which is not desired in the decision-making process. For a quality result, it is necessary to clean this dirty data. To convert dirty data into quality data, there is a need for data pre-processing techniques.

Techniques of Data Pre-processing

We look at the major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction, and data transformation.

Figure 1 Techniques of Data Pre-processing

Data Cleaning

Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Furthermore, dirty data can cause confusion in the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding over-fitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data-cleaning routines.

Data cleaning or data cleansing techniques attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.

Tasks in Data Cleaning:

  • Fill in missing values
  • Identify outliers and smooth noisy data
  • Correct inconsistent data

Fill in Missing Values:

  • Ignore the tuple
  • Fill in the missing values manually
  • Use a global constant to fill in the missing value.
  • Use the most probable value
  • Use the attribute mean or median for all the samples belonging to the same class as the given tuple

Identify outliers and Smooth Noisy Data:

  • Binning
  • Regression
  • Outlier analysis.

Data Integration

Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process. Data Integration is a data preprocessing technique that merges the data from multiple heterogeneous data sources into a coherent data store. Data integration may involve inconsistent data and therefore needs data cleaning Data Integration is the process of integrating data from multiple sources and has a single view over all these sources. Data integration can be physical or virtual.

Tasks in Data Integration:

  • Data Integration-Combines data from multiple sources into a single data store.
  • Schema integration-Integrate metadata from different sources
  • Entity identification problem-Identify real-world entities from multiple data sources
  • Detecting and resolving data value conflicts-For the same real-world entity, attribute values from different sources are different
  • Handling Redundancy in Data Integration

Data Transformation

Data transformation is the process of converting data from one format or structure into another format or structure. In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.

In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for data transformation include the following:

  • Smoothing, this works to remove noise from the data. Techniques include binning, regression, and clustering.
  • Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.
  • Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
  • Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0. 5.
  • Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
  • Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level.

Data Reduction

A database/data warehouse may store terabytes of data and to perform complex analysis on such voluminous data may take a very long time on the complete data set. Therefore, data reduction is used to obtain a reduced representation of the data set that is much smaller in volume but yet produces the same analytical results. Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form

Data reduction Strategies:

  • Data Compression
  • Dimensionality reduction
  • Discretization and concept hierarchy generation
  • Numerosity reduction
  • Data cube aggregation.

References

[1] Tomar, Divya, and Sonali Agarwal, “A survey on pre-processing and post-processing techniques in data mining”, International Journal of Database Theory and Application 7, no. 4 (2014): pp. 99-128.

[2] Bilquees Bhagat, “Data pre-processing techniques in data mining”, September 2, 2017, available online at: https://cloudera2017.wordpress.com/2017/09/02/1182/

[3] “Data Preprocessing”, available online at: http://www.comp.dit.ie/btierney/BSI/Han%20Book%20Ch3%20DataExploration.pdf

Exit mobile version