Efficient Joins Between Data Stored in Hdfs and Enterprise Warehouse Paper Reviews

Role I: What is an information system?

Affiliate 4: Data and Databases

Upon successful completion of this affiliate, you volition be able to:

  • Describe the differences between data, information, and knowledge;
  • Describe why database technology must be used for data resources management;
  • Define the term database and place the steps to creating i;
  • Describe the role of a database management organisation;
  • Describe the characteristics of a data warehouse; and
  • Ascertain data mining and describe its role in an organisation.

Introduction

You have already been introduced to the first 2 components of data systems: hardware and software. However, those two components by themselves practise not make a calculator useful. Imagine if y'all turned on a computer, started the word processor, but could not salve a document. Imagine if y'all opened a music player but in that location was no music to play. Imagine opening a spider web browser but there were no web pages. Without information, hardware and software are not very useful! Information is the third component of an information organisation.

Information, Data, and Knowledge

There have been many definitions and theories about data, information, and knowledge.  The iii terms are often used interchangeably, although they are distinct in nature.  We define and illustrate the three terms from the perspective of information systems.

Data are the raw facts, and may be devoid of context or intent.  For example, a sales order of computers is a slice of data.  Data can exist quantitative or qualitative. Quantitative data is numeric, the result of a measurement, count, or some other mathematical calculation. Qualitative data is descriptive. "Carmine Red," the color of a 2013 Ford Focus, is an example of qualitative data. A number can exist qualitative too: if I tell you my favorite number is 5, that is qualitative data because it is descriptive, not the result of a measurement or mathematical adding.

Data is processed data that possess context, relevance, and purpose.  For example, monthly sales calculated from the nerveless daily sales data for the past year are information.  Data typically involves the manipulation of raw data to obtain an indication of magnitude, trends, in patterns in the information for a purpose.

Knowledge in a certain surface area is human behavior or perceptions near relationships among facts or concepts relevant to that expanse.  For example, the conceived relationship between the quality of goods and the sales is knowledge.  Knowledge tin can be viewed as data that facilitates action.

Once nosotros take put our data into context, aggregated and analyzed it, nosotros tin can use information technology to brand decisions for our organization. We can say that this consumption of information produces knowledge. This noesis tin be used to brand decisions, set policies, and even spark innovation.

Explicit knowledge typically refers to knowledge that can be expressed into words or numbers.  In contrast, tacit knowledge includes insights and intuitions, and is hard to transfer to another person past ways of simple communications.

Evidently, when information or explicit knowledge is captured and stored in reckoner, it would become data if the context or intent is devoid.

The final step up the information ladder is the step from knowledge (knowing a lot about a topic) to wisdom. We tin say that someone has wisdom when they can combine their knowledge and feel to produce a deeper understanding of a topic. It oftentimes takes many years to develop wisdom on a item topic, and requires patience.

Big Data

Almost all software programs require data to do anything useful. For case, if you are editing a certificate in a word processor such as Microsoft Word, the document yous are working on is the data. The word-processing software tin can manipulate the data: create a new document, duplicate a document, or modify a document. Some other examples of data are: an MP3 music file, a video file, a spreadsheet, a web page, a social media post, and an east-book.

Recently, large information has been capturing the attention of all types of organizations. The term refers to such massively large data sets that conventional data processing technologies do not take sufficient ability to analyze them. For instance, Walmart must process millions customer transactions every hour across the world. Storing and analyzing that much data is beyond the power of traditional data management tools. Understanding and developing the all-time tools and techniques to manage and clarify these big data sets are a problem that governments and businesses akin are trying to solve.

Databases

The goal of many information systems is to transform data into information in lodge to generate knowledge that tin exist used for conclusion making. In guild to practice this, the system must be able to take data, allow the user to put the data into context, and provide tools for assemblage and analysis. A database is designed for just such a purpose.

Why Databases?

Data is a valuable resource in the organization.  However, many people do not know much almost database engineering, but apply non-database tools, such as Excel spreadsheet or Discussion document, to shop and manipulate business organization data, or use poorly designed databases for business processes.  As a consequence, the information are redundant, inconsistent, inaccurate, and corrupted.  For a modest data set, the apply of not-database tools such as spreadsheet may non cause serious problem.  All the same, for a large organization, corrupted information could lead to serious errors and destructive consequences.  The mutual defects in data resources management are explained as follows.

(1) No control of redundant data

People frequently go along redundant data for convenience.  Redundant data could make the data set inconsistent.  We use an illustrative case to explain why redundant data are harmful.  Suppose the registrar's function has two separate files that shop pupil data: one is the registered pupil roster which records all students who accept registered and paid the tuition, and the other is pupil class roster which records all students who have received grades.

Example of redundant data

Every bit you tin can see from the two spreadsheets, this information direction organization has problems.  The fact that "Student 4567 is Mary Brown, and her major is Finance" is stored more than than once.  Such occurrences are called data redundancy.  Redundant data often make data access user-friendly, but can be harmful.  For case, if Mary Brown changes her name or her major, and then all her names and major stored in the organisation must be changed altogether.  For small data systems, such a trouble looks little.  However, when the data arrangement is huge, making changes to all redundant data is hard if not incommunicable.  As a result of information back-up, the entire data gear up can be corrupted.

(2) Violation of data integrity

Information integrity means consistency among the stored data.  We use the above illustrative example to explain the concept of data integrity and how data integrity can exist violated if the data arrangement is flawed.  You lot tin can find that Alex Wilson received a grade in MKT211; however, you tin't find Alex Wilson in the educatee roster.  That is, the two rosters are not consequent.  Suppose we accept a data integrity command to enforce the rules, say, "no student can receive a grade unless she/he has registered and paid tuition", then such a violation of data integrity can never happen.

(3) Relying on human retention to store and to search needed data

The third common fault in data resource management is the over use of human retentiveness for information search.  A human tin can call back what information are stored and where the information are stored, simply can besides brand mistakes.  If a piece of data is stored in an un-remembered identify, it has actually been lost.  Every bit a result of relying on human memory to store and to search needed information, the entire data set somewhen becomes disorganized.

To avoid the in a higher place common flaws in data resource management, database technology must exist applied.  A database is an organized collection of related data. It is an organized collection, considering in a database, all data is described and associated with other data.  For the purposes of this text, we will but consider computerized databases.

Though not expert for replacing databases, spreadsheets can be ideal tools for analyzing the data stored in a database. A spreadsheet package can exist connected to a specific table or query in a database and used to create charts or perform assay on that data.

Data Models and Relational Databases

Databases can be organized in many unlike means by using different models. The data model of a database is the logical structure of data items and their relationships.   There have been several data models.  Since the 1980s, the relational data model has been popularized.  Currently, relational database systems are commonly used in business organizations with few exceptions.  A relational information model is easy to sympathize and use.

In a relational database, data is organized into tables (or relations).  Each table has a gear up of fields which define the structure of the information stored in the table.  A record is one case of a set of fields in a tabular array. To visualize this, think of the records equally the rows (or tuple) of the table and the fields as the columns of the table.

In the instance below, we have a table of educatee data, with each row representing a educatee tape , and each column representing i filed of the student record.   A special filed or a combination of fields that determines the unique record is called master cardinal (or central).  A key is usually the unique identification number of the records.

Rows and columns in a table
Rows and columns in a table

Designing a Database

Suppose a university wants to create a Schoolhouse Database to track data.  After interviewing several people, the design squad learns that the goal of implementing the system is to requite improve insight into students' functioning and academic resources.  From this, the squad decides that the system must proceed runway of the students, their grades, courses, and classrooms. Using this information, the design team determines that the following tables need to be created:

  • STUDENT: student name, major, and e-mail.
  • COURSE: form title, enrollment capacity.
  • GRADE: this table will correlate STUDENT with Class, allowing usa to take any given student to enroll multiple courses and to receive a grade for each class.
  • CLASSROOM: classroom location, classroom type, and classroom chapters

Now that the design squad has determined which tables to create, they demand to ascertain the specific data items that each table will hold. This requires identifying the fields that volition be in each table. For example, course title would be i of the fields in the COURSE tabular array. Finally, since this will exist a relational database, every tabular array should accept a field in mutual with at least 1 other table (in other words, they should have relationships with each other).

A primary key must be selected for each table in a relational database. This key is a unique identifier for each record in the tabular array. For case, in the Educatee table, information technology might be possible to use the student name as a way to identify a pupil.  Even so, it is more than than likely that some students share the same proper name.  A student's e-mail service address might be a good selection for a primary key, since e-mail addresses are unique. Even so, a principal key cannot change, so this would mean that if students inverse their electronic mail accost we would have to remove them from the database and then re-insert them – non an attractive proposition. Our solution is to apply educatee ID as the principal key of the STUDENT table.  We will likewise practice this for the Class table and the CLASSROOM tabular array. This solution is quite common and is the reason you take and then many IDs!  The primary cardinal of tabular array tin be just ane field, but can as well be a combination of 2 or more fields.  For example, the combination of StudentID and CourseID the GRADE table can exist the primary key of the GRADE table, which ways that a grade is received by a particular educatee for a specific grade.

The next stride of design of database is to place and brand the relationships betwixt the tables so that yous can pull the data together in meaningful ways.  A human relationship betwixt two tables is implemented past using a foreign key.  A foreign key is a field in one table that connects to the primary central data in the original table.  For instance, ClassroomID in the Grade tabular array is the foreign primal that connects to the primary key ClassroomID in the CLASSROOM tabular array.  With this design, not only do we accept a style to organize all of the data we need and have successfully related all the table together to meet the requirements, merely have also prevented invalid data from existence entered into the database.  Yous can see the concluding database design in the effigy beneath:

Tables of the student database
Tables of the student database

Normalization

When designing a database, one important concept to understand is normalization. In simple terms, to normalize a database means to design it in a way that: 1) reduces information redundancy; and ii) ensure information integrity.

In the School Database design, the pattern team worked to achieve these objectives. For example, to track grades, a unproblematic (and incorrect) solution might have been to create a Student field in the COURSE table and and so just listing the names of all of the students there. All the same, this design would mean that if a student takes two or more courses, and so his or her data would have to be entered twice or more times.  This means the data are redundant.  Instead, the designers solved this problem by introducing the Form table.

In this design, when a educatee registers into the school system before taking a form, nosotros commencement must add together the student to the STUDENT table, where their ID, name, major, and e-mail address are entered.  Now nosotros will add a new entry to denote that the educatee takes a specific course. This is accomplished by adding a tape with the StudentD and the CourseID in the GRADE table.  If this pupil takes a second course, we do not accept to indistinguishable the entry of the pupil's name, major, and electronic mail; instead, we only need to make another entry in the Class table of the 2d course's ID and the educatee'southward ID.

The design of the School database also makes it simple to change the pattern without major modifications to the existing structure. For example, if the pattern team were asked to add functionality    to the organisation to track instructors who teach the courses, we could hands reach this by adding a PROFESSOR table (similar to the Student table) and and so adding a new field to the Course table to hold the professors' ID.

Information Types

When defining the fields in a database table, we must requite each field a information type. For example, the field StudentName is text cord, while EnrollmentCapacity is number.  Most modern databases permit for several dissimilar data types to be stored. Some of the more than common data types are listed here:

  • Text: for storing non-numeric information that is brief, mostly under 256 characters. The database designer tin can identify the maximum length of the text.
  • Number: for storing numbers. At that place are normally a few dissimilar number types that can be selected, depending on how big the largest number volition be.
  • Boolean: a data blazon with only 2 possible values, such as 0 or 1, "truthful" or "false", "yes" or "no".
  • Engagement/Time: a special form of the number data blazon that tin be interpreted as a number or a time.
  • Currency: a special form of the number information type that formats all values with a currency indicator and two decimal places.
  • Paragraph Text: this data type allows for text longer than 256 characters.
  • Object: this information type allows for the storage of information that cannot be entered via keyboard, such as an paradigm or a music file.

There are 2 important reasons that nosotros must properly define the data blazon of a field. First, a data blazon tells the database what functions can be performed with the data. For example, if nosotros wish to perform mathematical functions with one of the fields, nosotros must be sure to tell the database that the field is a number data type. For example, we can subtract the course capacity from the classroom capacity to find out the number of extra seats available.

The second important reason to ascertain data blazon is so that the proper corporeality of storage infinite is allocated for our data. For example, if the StudentName field is defined equally a Text(50) data type, this means fifty characters are allocated for each name nosotros want to shop.  If a student's name is longer than 50 characters, the database will truncate it.

Database Direction Systems

Open Office DBMS
Open Function Database Management System

To the computer, a database looks like ane or more files. In club for the data in the database to be stored, read, changed, added, or removed, a software program must access it. Many software applications have this ability: iTunes can read its database to give you lot a listing of its songs (and play the songs); your mobile-telephone software can interact with your list of contacts. But what about applications to create or manage a database? What software tin can you employ to create a database, change a database's construction, or simply do analysis? That is the purpose of a category of software applications called database management systems (DBMS).

DBMS packages mostly provide an interface to view and change the design of the database, create queries, and develop reports. Most of these packages are designed to piece of work with a specific type of database, just by and large are compatible with a broad range of databases.

A database that can only exist used past a single user at a fourth dimension is not going to come across the needs of most organizations.  As computers have go networked and are now joined worldwide via the Internet, a class of database has emerged that can be accessed past two, ten, or even a million people. These databases are sometimes installed on a single computer to be accessed by a group of people at a single location. Other times, they are installed over several servers worldwide, meant to be accessed by millions.  In enterprises the relational DBMS are built and supported past companies such as Oracle, Microsoft SQL Server, and IBM Db2. The open-source MySQL is also an enterprise database.

Microsoft Access and Open Office Base are examples of personal database-direction systems. These systems are primarily used to develop and analyze single-user databases. These databases are non meant to exist shared beyond a network or the Internet, merely are instead installed on a item device and piece of work with a single user at a time.  Apache OpenOffice.org Base (see screen shot) can be used to create, modify, and analyze databases in open-database (ODB) format. Microsoft'southward Access DBMS is used to piece of work with databases in its own Microsoft Access Database format. Both Access and Base have the ability to read and write to other database formats every bit well.

Structured Query Language

In one case you have a database designed and loaded with data, how will you do something useful with it? The primary way to work with a relational database is to employ Structured Query Language, SQL (pronounced "sequel," or merely stated every bit South-Q-L). Almost all applications that piece of work with databases (such as database management systems, discussed below) make employ of SQL as a way to analyze and manipulate relational information. As its name implies, SQL is a language that can be used to piece of work with a relational database. From a

unproblematic request for data to a circuitous update functioning, SQL is a mainstay of programmers and database administrators. To give you a taste of what SQL might look like, hither are a couple of examples using our School database:

The following query will think the major of student John Smith from the STUDENT table:

SELECT StudentMajor  FROM Student  WHERE StudentName = 'John Smith';

The following query will list the total number of students in the Pupil table:

SELECT COUNT(*)  FROM STUDENT;

SQL can be embedded in many calculator languages that are used to develop platform-independent spider web-based applications.  An in-depth description of how SQL works is beyond the telescopic of this introductory text, but these examples should give yous an idea of the power of using SQL to dispense relational databases.  Many DBMS, such equally Microsoft Access, allow you to use QBE (Query-by-Example), a graphical query tool, to call back data though visualized commands.  QBE generates SQL for you lot, and is easy to utilise.  In comparison with SQL, QBE has limited functionalities and is unable to work without the DBMS environment.

Other Types of Databases

The relational database model is the most used database model today. However, many other database models exist that provide dissimilar strengths than the relational model. The hierarchical database model, pop in the 1960s and 1970s, continued data together in a bureaucracy, allowing for a parent/child human relationship between data. The certificate-centric model immune for a more unstructured information storage past placing data into "documents" that could and so exist manipulated.

Maybe the almost interesting new evolution is the concept of NoSQL (from the phrase "non only SQL"). NoSQL arose from the need to solve the trouble of large-calibration databases spread over several servers or even across the world. For a relational database to work properly, information technology is important that only ane person exist able to dispense a slice of data at a time, a concept known equally record-locking. Only with today's big-scale databases (remember Google and Amazon), this is merely not possible. A NoSQL database can piece of work with data in a looser way, allowing for a more than unstructured surroundings, communicating changes to the data over time to all the servers that are part of the database.

Every bit stated earlier, the relational database model does non calibration well. The term scale here refers to        a database getting larger and larger, beingness distributed on a larger number of computers connected via a network. Some companies are looking to provide large-scale database solutions by moving away from   the relational model to other, more than flexible models. For example, Google now offers the App Engine Datastore, which is based on NoSQL. Developers tin can use the App Engine Datastore to develop applications that access information from anywhere in the earth. Amazon.com offers several database services for enterprise use, including Amazon RDS, which is a relational database service, and Amazon DynamoDB, a NoSQL enterprise solution.


Sidebar: What Is Metadata?

The term metadata tin be understood equally "data most data."  Examples of metadata of database are:

  • number of records
  • data type of field
  • size of field
  • clarification of field
  • default value of field
  • rules of apply.

When a database is being designed, a "data dictionary" is created to agree the metadata, defining the fields and structure of the database.


Finding Value in Data: Business concern Intelligence

With the rise of Big Data and a myriad of new tools and techniques at their disposal, businesses are learning how to use information to their advantage. The term business intelligence is used to describe the procedure that organizations use to accept data they are collecting and analyze it in the hopes of obtaining a competitive advantage. Besides using their ain data, stored in information warehouses (see below), firms often buy information from information brokers to go a big-motion-picture show understanding of their industries and the economy. The results of these analyses can drive organizational strategies and provide competitive advantage.

Information Visualization

Data visualization is the graphical representation of information and information. These graphical representations (such as charts, graphs, and maps) can apace summarize data in a fashion that is more intuitive and can pb to new insights and understandings. Just as a moving-picture show of a mural can convey much more than a paragraph of text attempting to describe it, graphical representation of data can chop-chop make meaning of big amounts of data. Many times, visualizing data is the showtime stride towards a deeper assay and understanding of the data collected by an organization. Examples of data visualization software include Tableau and Google Data Studio.

Information Warehouses

As organizations have begun to utilize databases as the centerpiece of their operations, the need to fully empathize and leverage the information they are collecting has get more and more apparent. Yet, directly analyzing the data that is needed for day-to-day operations is not a proficient idea; we do non want to tax the operations of the company more than we need to. Further, organizations likewise want to analyze data in a historical sense: How does the data we take today compare with the same set of information this fourth dimension last month, or concluding year? From these needs arose the concept of the information warehouse.

The concept of the information warehouse is simple: extract data from one or more of the organization'southward databases and load it into the data warehouse (which is itself some other database) for storage and assay. However, the execution of this concept is not that uncomplicated. A data warehouse should be designed so that it meets the following criteria:

  • Information technology uses non-operational information. This means that the data warehouse is using a copy of data from the active databases that the visitor uses in its mean solar day-to-mean solar day operations, so the data warehouse must pull data from the existing databases on a regular, scheduled basis.
  • The data is time-variant. This ways that whenever data is loaded into the data warehouse, information technology receives a time postage, which allows for comparisons between different time periods.
  • The data is standardized. Considering the data in a data warehouse usually comes from several different sources, information technology is possible that the data does not apply the aforementioned definitions or units. For example, each  database uses its own format for dates (e.yard., mm/dd/yy, or dd/mm/yy, or yy/mm/dd, etc.).  In society for the data warehouse to friction match upwardly dates, a standard date format would take to be agreed upon and all information loaded into the data warehouse would have to be converted to use this standard format. This process is called extraction-transformation-load (ETL).

There are two primary schools of thought when designing a data warehouse: bottom-up and peak-down. The lesser-up arroyo starts past creating pocket-size information warehouses, chosen data marts, to solve specific business problems. As these data marts are created, they can be combined into a larger data warehouse. The elevation- down approach suggests that nosotros should commencement by creating an enterprise-wide data warehouse and then, as specific business needs are identified, create smaller data marts from the data warehouse.

Data Warehouse Process diagram
Data Warehouse Procedure (top-down)

Benefits of Data Warehouses

Organizations find data warehouses quite benign for a number of reasons:

  • The process of developing a data warehouse forces an organization to ameliorate understand the data that it is currently collecting and, equally important, what information is non existence collected.
  • A information warehouse provides a centralized view of all information beingness collected beyond the enterprise and provides a means for determining data that is inconsistent.
  • In one case all data is identified as consequent, an organization can generate "one version of the truth". This is of import when the company wants to report consistent statistics nearly itself, such every bit revenue or number of employees.
  • By having a data warehouse, snapshots of data tin can exist taken over time. This creates a historical record of information, which allows for an analysis of trends.
  • A data warehouse provides tools to combine data, which can provide new information and analysis.

Data Mining and Machine Learning

Data mining is the procedure of analyzing information to find previously unknown and interesting trends, patterns, and associations in social club to make decisions. Generally, data mining is accomplished through automated means confronting extremely large data sets, such as a information warehouse. Some examples of information mining include:

  • An assay of sales from a large grocery concatenation might decide that milk is purchased more oft the day afterwards information technology rains in cities with a population of less than l,000.
  • A bank may find that loan applicants whose banking concern accounts show particular deposit and withdrawal patterns are not expert credit risks.
  • A baseball team may find that collegiate baseball players with specific statistics in hitting, pitching, and fielding make for more successful major league players.

One data mining method that an system tin use to practice these analyses is chosen motorcar learning. Machine learning is used to analyze information and build models without being explicitly programmed to do so. Two primary branches of auto learning exist: supervised learning and unsupervised learning.

Supervised learning occurs when an organisation has information well-nigh past activeness that has occurred and wants to replicate it. For example, if they want to create a new marketing campaign for a particular production line, they may look at data from by marketing campaigns to see which of their consumers responded most favorably. Once the analysis is done, a car learning model is created that can be used to identify these new customers. Information technology is called "supervised" learning because we are directing (supervising) the analysis towards a effect (in our example: consumers who respond favorably). Supervised learning techniques include analyses such every bit decision copse, neural networks, classifiers, and logistic regression.

Unsupervised learning occurs when an organization has information and wants to understand the human relationship(s) between dissimilar data points. For case, if a retailer wants to understand purchasing patterns of its customers, an unsupervised learning model tin can be adult to find out which products are most ofttimes purchased together or how to group their customers by purchase history. Is information technology called "unsupervised" learning because no specific outcome is expected. Unsupervised learning techniques include clustering and clan rules.

Privacy Concerns

The increasing power of data mining has acquired concerns for many, particularly in the area of privacy. In today'southward digital globe, it is condign easier than ever to have data from disparate sources and combine them to practise new forms of assay. In fact, a whole industry has sprung up around this engineering: data brokers. These firms combine publicly accessible data with information obtained from the authorities and other sources to create vast warehouses of data about people and companies that they can and so sell. This subject will exist covered in much more than detail in chapter 12 – the chapter on the upstanding concerns of information systems.


Sidebar: What is data scientific discipline? What is data analytics?

The term "data science" is a popular term meant to depict the analysis of large data sets to find new knowledge. For the past several years, it has been considered one of the all-time career fields to get into due to its explosive growth and high salaries. While a data scientist does many dissimilar things, their focus is generally on analyzing large data sets using various programming methods and software tools to create new cognition for their arrangement.  Information scientists are skilled in machine learning and data visualization techniques. The field of information science is constantly irresolute, and information scientists are on the cut edge of work in areas such equally bogus intelligence and neural networks.


Knowledge Management

We terminate the affiliate with a discussion on the concept of knowledge management (KM). All companies accumulate knowledge over the class of their beingness. Some of this cognition is written downward or saved, but non in an organized mode. Much of this knowledge is not written down; instead, information technology is stored inside the heads of its employees. Knowledge direction is the process of creating, formalizing the capture, indexing, storing, and sharing of the company'southward cognition in order to do good from the experiences and insights that the visitor has captured during its existence.

Summary

In this chapter, we learned about the role that information and databases play in the context of data systems. Data is fabricated up of facts of the world.  If you process data in a particular context, then you have information. Knowledge is gained when information is consumed and used for determination making.  A database is an organized collection of related data. Relational databases are the almost widely used type of database, where data is structured into tables and all tables must be related to each other through unique identifiers. A database direction system (DBMS) is a software application that is used to create and manage databases, and tin can take the form of a personal DBMS, used past one person, or an enterprise DBMS that can be used by multiple users. A data warehouse is a special form of database that takes information from other databases in an enterprise and organizes it for analysis. Data mining is the process of looking for patterns and relationships in large information sets. Many businesses use databases, data warehouses, and data-mining techniques in guild to produce business concern intelligence and proceeds a competitive advantage.


Study Questions

  1. What is the difference between data, information, and noesis?
  2. Explain in your own words how the information component relates to the hardware and software components of data systems.
  3. What is the difference between quantitative information and qualitative information? In what situations could the number 42 be considered qualitative data?
  4. What are the characteristics of a relational database?
  5. When would using a personal DBMS make sense?
  6. What is the difference between a spreadsheet and a database? List three differences between them.
  7. Describe what the term normalization means.
  8. Why is it of import to define the information type of a field when designing a relational database?
  9. Name a database you interact with frequently. What would some of the field names exist?
  10. What is metadata?
  11. Proper noun 3 advantages of using a data warehouse.
  12. What is data mining?
  13. In your own words, explain the difference between supervised learning and unsupervised learning. Give an case of each (not from the book).

Exercises

  1. Review the pattern of the School database earlier in this affiliate. Reviewing the lists of information types given, what information types would you assign to each of the fields in each of the tables. What lengths would y'all assign to the text fields?
  2. Download Apache OpenOffice.org and employ the database tool to open the "Student Clubs.odb" file available hither. Take some time to learn how to modify the database structure and and so encounter if yous can add the required items to support the tracking of faculty advisors, equally described at the end of the Normalization section in the affiliate. Here is a link to the Getting Started documentation.
  3. Using Microsoft Access, download the database file of comprehensive baseball statistics from the website SeanLahman.com. (If you don't have Microsoft Admission, you can download an abridged version of the file here that is compatible with Apache Open Office). Review the structure of the tables included in the database. Come up with iii different data-mining experiments you would like to try, and explicate which fields in which tables would take to be analyzed.
  4. Exercise some original enquiry and discover ii examples of information mining. Summarize each example and then write most what the two examples have in common.
  5. Conduct some independent research on the process of business concern intelligence. Using at least two scholarly or practitioner sources, write a 2-folio paper giving examples of how business intelligence is being used.
  6. Conduct some independent enquiry on the latest technologies being used for knowledge management. Using at least ii scholarly or practitioner sources, write a 2-page paper giving examples of software applications or new technologies existence used in this field.

harveyjece1951.blogspot.com

Source: https://opentextbook.site/informationsystems2019/chapter/chapter-4-data-and-databases-update/

0 Response to "Efficient Joins Between Data Stored in Hdfs and Enterprise Warehouse Paper Reviews"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel