For AI to work, it needs data — lots of data –, but it also needs quality data. Generating data is not a problem; in fact, the world produces approximately 2.5 quintillion bytes every day. It’s the quality of the data that AI professionals struggle with. Their algorithms cannot deliver quality insights if the data is flawed.
Much of today’s technology depends on data. Whether it is analytics or artificial intelligence, the modern enterprise needs data to survive. But data is more than numbers in a spreadsheet or fields in a database. Today, we discuss how to develop a strategy that ensures data quality from start to finish, beginning with how data is stored.
How is Data Stored?
Data is typically stored in one of the following formats:
- Structured. Structured data uses tabular or columnar formats as in databases and spreadsheets. Programs can quickly find the data based on its location.
- Unstructured. Computers struggle to find specific information in unstructured data. Text files, emails, presentations, and videos can store data anywhere, making it difficult for computers to quickly locate a particular data point.
- Metadata. Web pages are examples of data sources that use metadata. The metadata contains information on what the web page contains and is used by search engines to identify topic and possible file locations.
- Semi-structured. Some programming frameworks such as SML and JSON use self-defining structures to help with data location.
Only a small percentage of data is structured — about 10%. Most comes in the form of text files, emails, and documents. Unfortunately, AI needs structured data for maximum performance, which requires converting terabytes of unstructured data into high-quality structured data. Depending on the types and quality of data, companies can spend months cleaning, preparing, and integrating data before a single algorithm is used. For organizations eager to implement an AI solution, the time spent improving the data can seem unwarranted.
What Makes Quality Data?
All data is not created equal; however, it should meet a minimum standard for data quality to ensure a successful AI implementation. The following criteria are used to determine quality:
- Accuracy. 100% accuracy is not always feasible. Establishing an accuracy standard helps ensure consistency across large datasets.
- Completeness. Simply stated, the more complete the data, the better the outcome.
- Consistency. Data needs to appear in the same format regardless of its source. For example, state names should be abbreviated or spelled out, but not both.
- Traceability. Data integrity is based on the ability to trace data back to its original source. If questions arise, the inputs should be tied to the original source.
- Uniformity. Mathematical values and measurements should use a uniform type, or calculations may fail.
- Validity. Valid data is the data that is to be used for the specific AI implementation. Extraneous data slows the learning process.
The timeliness of the data should be determined at the start of the project. If the project is looking for future trends, older data can skew the results. Deciding the time range for data ensures that all data is from the same period, regardless of the source. When looking at data quality, the programmer’s adage of garbage in, garbage out applies. Successful AI implementations require high-quality data.
Why Clean Data?
Most people have experienced the confusion that comes when duplicate profiles exist on the same system. Half of the information is in one profile, and the rest may be spread over one maybe two additional records. This duplication is frustrating for humans, but it’s disastrous for AI implementations. That’s why eliminating the following is an essential first step in delivering quality data.
- Conflicting. Information may conflict. For example, the number of red cars sold in January is 25 in one database and 27 in another. That discrepancy may seem minor when talking about cars, but a two-point deviation can have catastrophic results when dispensing medication.
- Corrupt. Depending on how corrupted the data, it may be possible to recover parts of it. Whether to attempt recovery depends on the value of the data to the AI algorithms being used.
- Invalid. Sanity checks should be performed on data fields. For example, when an eight-digit social security number appears, it should be flagged as invalid. Social security numbers are nine digits.
- Incomplete. Ignoring missing data can change results. If the missing data cannot be added, adjustments may be required in the AI algorithm to account for missing information.
If an AI process encounters two records for the same individual, how does it decide which is correct? It’s possible that information needs to be pulled from both records to produce a valid result. Part of the cleaning process is determining what to do when data discrepancies appear. Given the volume of data involved, it’s important to have uniform cleaning criteria in place.
What is Preparing Data?
Prepping data for AI use means converting or transforming it into usable formats. During this process, data is reduced to the essential items needed by the AI algorithms. Prepping and cleaning data can occur in any order. Sometimes, data goes through an initial cleaning, is prepped, and undergoes additional cleaning. The process depends on the condition of the data at the start of the project.
Preparing data for AI involves the following:
- Unused Data. Removing data that is not used by AI algorithms reduces the drain on system resources. It makes for more efficient data usage and reduces data storage requirements.
- Time Stamps. Date and time fields are the bane of any data scientist’s existence. Dates can be MM/DD/YYYY or DD/MM/YYYY. Older systems may use two-digit years or a Julian date.
- Data Types. Reclassifying data types can also improve efficiencies. Changing a numerical field to an integer can simplify calculations.
- Format. Removing formatting such as line breaks, tabs, and carriage returns makes it easier for AI to ingest data.
Data uniformity makes AI more efficient and effective. Removing unnecessary data minimizes the volume of information that needs processing. Something as basic as changing data types and removing formatting can significantly improve performance.
Why Integrate Data?
Integration may be one of the last steps in delivering quality data, but it’s not the last process to be discussed. Integrating data means creating a data pipeline that extracts information from multiple sources and brings it together for effective and actionable intelligence. Like the other steps, integration is not without challenges.
- Legacy systems. Older systems may not time stamp activities, and some information may be cryptic to minimize storage space. They may run slower and lack the APIs to interface with newer systems.
- Updated data. The data pipeline must operate 24/7 to ensure that AI solutions are using the latest information. Determining how to automate a pipeline with legacy and advanced technologies can challenge anyone’s ability to integrate enterprise-wide systems.
- Advanced technologies. With more devices such as sensors and IoT being deployed, the volume of available data increases. Data is not stored in the same format as legacy systems, although they are more likely to have API support.
Successful integration depends on data sources and controls. The more control an organization has over its data, the stronger the base for AI training. Achieving that control takes time and planning to ensure limited downtime and optimum performance.
Why Data is Important
Without data, AI doesn’t exist. Data allows AI algorithms to learn. If data is not clean, less information is available, leaving AI with reduced datasets for training. Unnecessary resources may be expended at runtime if data is not properly prepared. This lack of quality data means slowing the AI process to improve data quality or delivering faulty results. And, the need for data doesn’t stop after AI training is complete, updated information needs to be added continuously for the most accurate insights.
F33.ai helps organizations build more efficient and more effective AI implementations through comprehensive data sourcing. Our team drives the development of customer solutions that continuously expand the limits of what AI can do, making it possible to deliver more impactful business results. To learn more about the importance of data in making AI and ML work in your organization, contact us to set up a discussion.