Difference between revisions of "Validate data"
(52 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | [[Category: | + | [[Category:How_To]] |
− | + | [[Category:How_To V11]] | |
=Overview= | =Overview= | ||
− | To Validate | + | To Validate data you need to have the following in place. |
− | |||
− | |||
− | |||
==Data Provider== | ==Data Provider== | ||
− | A Data Provider is an Organisation Type | + | A [[Organisations_-_Data_Providers|Data Provider]] is an Organisation Type. When a [[Provision_Agreements|Provision Agreement]] is created a [[Dataflows|Dataflow]] and a Data Provider must be selected. An example Data Provider is shown below. |
− | [[File:val1.PNG|500px]]<br> | + | [[File:val1.PNG|Data Provider - Step 1|500px]]<br> |
==Provision Agreement== | ==Provision Agreement== | ||
− | A Provision Agreement is the union of a Dataflow with a Data Provider. A | + | A Provision Agreement (PA)is the union of a Dataflow with a Data Provider. A PA is a definition that the Data Provider can provide data for the Dataflow. Data is always reported by a Data Provider against the PA. [https://fmrwiki.sdmxcloud.org/Provision_Agreements You can read more about Provision Agreements in this article]. An example Provision Agreement is shown below. |
− | |||
− | |||
+ | [[File:val2.PNG|Creating a Provision Agreement|850px]]<br> | ||
==Dataflow== | ==Dataflow== | ||
A Dataflow is a structure on which data is collected and disseminated. A Dataflow references a Data Structure Definition (DSD) which is used as the underlying template to which the data must conform. | A Dataflow is a structure on which data is collected and disseminated. A Dataflow references a Data Structure Definition (DSD) which is used as the underlying template to which the data must conform. | ||
− | [https:// | + | [https://wiki.sdmxcloud.org/Dataflow_-_Structural_Metadata_Management You can read more about Dataflows in this article]. An example Dataflow is shown below. |
+ | |||
+ | |||
+ | [[File:val3.PNG|Dataflow - Step 1|500px]]<br> | ||
+ | |||
+ | [[File:val4.PNG|Dataflow - Step 2|850px]]<br> | ||
+ | |||
+ | =Load Data= | ||
+ | |||
+ | Once all the elements are in place as described above, the next step is to load the data which is done via the Convert option on the Data Menu. | ||
+ | |||
+ | Data can be loaded from a file of via a URL (for example from the SDMX Global Registry site). | ||
+ | |||
+ | To successfully validate, the data must adhere to the SDMX standard in terms of format as well as what has been defined in the Data Structure. | ||
+ | |||
+ | [https://fmrwiki.sdmxcloud.org/Data_Structure_Definitions You can read more about Data Structures in this article]. | ||
+ | |||
+ | [https://fmrwiki.sdmxcloud.org/Create_a_simple_DSD You can read more about how to create a simple Data Structures in this article]. | ||
+ | |||
+ | '''Supported formats:''' | ||
+ | * SDMX-ML (3.0, 2.1 , 2.0 and 1.0) | ||
+ | * EDI | ||
+ | * SDMX-JSON | ||
+ | * SDMX-CSV | ||
+ | * XLS / XLSX | ||
+ | |||
+ | '''Unsupported formats:''' | ||
+ | * Any other CSV format (Fusion CSV, Time Series CSV) | ||
+ | * RDF, Turtle, etc. | ||
+ | |||
+ | [[File:val5.PNG|Convert Data Page|850px]]<br> | ||
+ | |||
+ | To see this process in action you can watch this video. | ||
+ | |||
+ | |||
+ | [[File:LOADINGDATA.mp4]]<br> | ||
+ | |||
+ | ==Validate Data== | ||
+ | |||
+ | Click Load Data to start the validation process as explained in the image below. | ||
+ | |||
+ | [[File:val6.PNG|Dataset Details Page|850px]]<br> | ||
+ | |||
+ | =Validation Scheme= | ||
+ | ==What is a Validation Scheme?== | ||
+ | Validation Schemes define one or more validation rules which can be executed against a Dataflow at the data validation stage of a data load. Each validation rule consists of a mathematical expression or a link to an aggregation hierarchy which is used to create an expression. This validation goes beyond syntactical and semantical validation of the dataset and is instead checking that the values supplied in the dataset conform to specific business rules. Examples of this could be that a particular field must have a value less than 100, or that the total value reported must be the same as the total of a set of other observation values. | ||
+ | |||
+ | |||
+ | [[File:Valid1.PNG|Validation Scheme Page|850px]]<br> | ||
+ | |||
+ | A Validation Scheme must be assigned against a single Dataflow and may consist of one or many validation rules. A single Validation Rule consists of: | ||
+ | |||
+ | * an ID and name | ||
+ | * an optional description | ||
+ | * a type: either a custom expression or a hierarchic expression | ||
+ | * a result type (either numerical or code type) and a value (e.g. 100 or GDP) | ||
+ | * an equality operator (one of the following mathematical operators: =, <>, <, <=, >, >= ) | ||
+ | * an expression (e.g. [EUR]+[FR] ) | ||
+ | |||
+ | The Validation Scheme rules will be applicable to all datasets submitted against the Dataflow the Validation Scheme is linked to. | ||
+ | |||
+ | ==How are Rules Applied== | ||
+ | |||
+ | A validation rule operates on a single dimension, an example of a rule to calculate Total from the inputs Males and Females would look like the following: | ||
+ | |||
+ | [T] = [M] + [F] | ||
+ | |||
+ | '''Note''': the syntax used in a validation scheme puts code Ids into square brackets. | ||
+ | |||
+ | This rule would be applied to every series where all other parts of the series key match, so the following series there would be two matches to this rule, one for employment, and one for unemployment. | ||
+ | |||
+ | |||
+ | [[File:Valid2.PNG|Rule Application 1|850px]]<br> | ||
+ | |||
+ | For a validation rule to be executed there must be data reported for the output, and at least one of the inputs. If data are missing in the inputs, then they are treated a zero value. In the following example, only 1 rule is matched, and there is only one input (Male). | ||
+ | |||
+ | |||
+ | [[File:Valid3.PNG|Rule Application 2|850px]]<br> | ||
+ | |||
+ | There are two types of validation rules, ones which use a custom written expression, as described above. The second type references a Hierarchy in the Registry, and the Hierarchy is used as the basis for an Aggregation expression. For example, the following image shows a hierarchy of countries, against theoretical reported values. This is an example of a hierarchy being used to validate a dataset. | ||
+ | A hierarchy can be applied to any dimension that uses the same Codelist as the Hierarchy. When values are read in the data file, the totals at each sub-hierarchy are summed up to ensure they are consistent with the parent value. If any values are missing data, they are treated as having a value of zero. | ||
+ | |||
+ | |||
+ | [[File:Valid4.PNG|Rule Diagram|thumb|850px]]<br> | ||
+ | |||
+ | '''Note''': the Registry only checks data in the submitted file, and does not cross check against any persisted data when validating. For example, if you have already stored the totals in a Registry database, submitting a Dataset containing the values making up the totals, the Registry will not validate from the file against the totals already stored. | ||
+ | |||
+ | =Validation Scheme Tutorial= | ||
+ | This tutorial describes the manual steps in the process to create a Validation Scheme. It is required that your Registry be populated with structures that support this process (such as Data Structure Definitions and Dataflows). | ||
+ | |||
+ | ==Overview== | ||
+ | This section will explain the creation of a simple Validation Scheme that demonstrates when Data is loaded into the Registry that the rules within the scheme will be used for validation purposes. | ||
+ | |||
+ | ==Creating the Validation Scheme from the User Interface== | ||
+ | A Validation Scheme is created or maintained by using the authoring Wizard. The cogs icon used to open the wizard, is only available to authenticated Agency or Admin users. | ||
+ | |||
+ | |||
+ | [[File:Create Validation Scheme.PNG|1000px]]<br> | ||
+ | |||
+ | There are four steps to creating a Validation Scheme and are described in the following sections: | ||
+ | |||
+ | ===Step 1: Details=== | ||
+ | The first step allows for the specification of the ID and other associated high-level values which make up the definition of the Validation Scheme. | ||
+ | |||
+ | |||
+ | [[File:Valid6.PNG|Validation Scheme - Step 1|600px]]<br> | ||
+ | |||
+ | Like all other SDMX Structures, the Id, Version, Name and Agency are mandatory, and the other values are optional. Enter these values and click on the '''Next''' button. | ||
+ | |||
+ | ===Step 2: Attachment=== | ||
+ | This step assigns the Validation Scheme to a single Dataflow. | ||
+ | |||
+ | |||
+ | [[File:Valid7.PNG|Validation Scheme - Step 2|600px]]<br> | ||
+ | |||
+ | Click on the Dataflow field which will show all Dataflows in the Registry. Select the Dataflow to attach this Scheme to and its details will be displayed in the lower half of this page. | ||
+ | |||
+ | Click on the '''Next''' button once the correct Dataflow has been selected. | ||
+ | |||
+ | ===Step 3: CSV Import=== | ||
+ | This step is optional and allows for the creation of several validation rules via text entry. | ||
+ | |||
+ | |||
+ | [[File:Valid8.PNG|Validation Scheme - Step 3|600px]]<br> | ||
+ | |||
+ | To create a rule, enter the rule details on its own line in the text area. Elements of the rule must be separated by commas. Each element must conform to the expected value of that field (e.g., the ID element must not contain illegal characters; the equality operator may only be one of the permitted values). The field order is determined by the drop-downs at the top of this step and can be changed by the user. | ||
+ | |||
+ | Syntax of an Expression: | ||
+ | |||
+ | Each expression operates on a single dimension, and is made up of three parts: | ||
+ | |||
+ | # Resultant<br> | ||
+ | # | ||
+ | # Equality Operator<br> | ||
+ | # | ||
+ | # Operands and Operators<br> | ||
+ | |||
+ | The resultant and operands can either be numerical constants (e.g., 100) or the Id of a code (e.g., EUROPE). If a code id is used, the syntax demands it is placed in square brackets [EUROPE]. | ||
+ | |||
+ | For example, an expression to validate the reported value for EUROPE may look like the following: | ||
+ | |||
+ | ''[EUROPE]=([UK]+[FR]+[DE]) - (([ES]+[IT]+[LUX]) * 2)'' | ||
+ | |||
+ | When importing this via CSV the output, equality operator, and expression are all supplied in their own CSV field. So, a full example of a CSV rule would look like the following: | ||
+ | |||
+ | ''SERIES_CHECK,Europe Total,REF_AREA,[EUROPE],=,"([UK]+[FR]+[DE]) - (([ES]+[IT]+[LUX]) * 2)"'' | ||
+ | |||
+ | This would create the rule with the following characteristics: | ||
+ | * Rule Id: SERIES_CHECK | ||
+ | * Rule Name: Europe Total | ||
+ | * Rule Description: <none> (no column index supplied) | ||
+ | * Dimension Id: REF_AREA | ||
+ | * Mathematical expression: EUROPE = (UK+FR+DE) - ((ES+IT+LUX) * 2) | ||
+ | |||
+ | ===Step 4: Expression Builder=== | ||
+ | Rather than use the CSV import (step 3) to create a validation rule, it may be easier to create a validation rule through the user interface on step 4. | ||
+ | |||
+ | |||
+ | [[File:Valid9.PNG|Validation Scheme - Step 4|800px]]<br> | ||
+ | |||
+ | The left-hand side of the page shows all dimensions for this Dataflow followed by a number. The number indicates the number of Validation Rules against that dimension. The right-hand side shows the rules that have been defined for the currently selected Dimension. Selecting a rule shows the expression for that rule (i.e., the logic of the rule itself) or if this is a hierarchical rule then the aggregation hierarchy is displayed. | ||
+ | |||
+ | Editing or creating a rule displays a dialog where the values for the rule can be defined. This is similar to step 3 of the Validation Scheme Wizard. | ||
+ | |||
+ | |||
+ | [[File:Valid10.PNG|Validation Scheme - Create Rule|800px]]<br> | ||
+ | |||
+ | As stated previously a rule must have an Id and Name and may have an optional description. The control at top-right allows the user to distinguish between a Custom Expression and a Hierarchical Expression (this choice is only available for Dimensions with a Concept that represents a Hierarchic codelist). | ||
+ | |||
+ | The Mathematical Expression can be defined as either a Code evaluated against an expression (e.g., EUROPE=FR+DE) or a value evaluated against an expression (e.g., 55 >= GE). The expression area may be typed into manually and the user will need to do this in order to add mathematical symbols and numbers as appropriate. However the Quick Code Lookup, situated at the bottom of this page, lists all of the codes for this Dimension. Clicking an item in this list will insert it into the expression area. | ||
+ | |||
+ | For Hierarchic Rules, these can only be created when the Codelist referenced by the Dimension is also referenced by a Hierarchy – this permits the Rule Type option of Aggregate Using Hierarchy to be selected. If no hierarchies exist which use the same Codelist as the Dimension, the Rule Type option will be fixed on Custom Rule. | ||
+ | |||
+ | When the rule type is set to Aggregate using Hierarchy the result and expression section of the dialog is replaced with a list of available Hierarchies that can be selected (under the heading “Hierarchy to Use”). A Hierarchy must be selected, and the view of the selected hierarchy is shown to the right of the hierarchical list. | ||
+ | [[File:Valid11.PNG|Validation Rules|800px]]<br> | ||
+ | Once the rule has been defined, clicking “Add Rule” will dismiss the dialog and add this rule to the selected dimension. | ||
− | = | + | Note that during creation, rules are not validated for logic or contradictory values. So, it is possible to create nonsensical rules such as EUROPE >= EUROPE. It is down to the user to ensure that the Validation Rules have a logical sense. |
− |
Latest revision as of 03:27, 13 May 2024
Overview
To Validate data you need to have the following in place.
Data Provider
A Data Provider is an Organisation Type. When a Provision Agreement is created a Dataflow and a Data Provider must be selected. An example Data Provider is shown below.
Provision Agreement
A Provision Agreement (PA)is the union of a Dataflow with a Data Provider. A PA is a definition that the Data Provider can provide data for the Dataflow. Data is always reported by a Data Provider against the PA. You can read more about Provision Agreements in this article. An example Provision Agreement is shown below.
Dataflow
A Dataflow is a structure on which data is collected and disseminated. A Dataflow references a Data Structure Definition (DSD) which is used as the underlying template to which the data must conform. You can read more about Dataflows in this article. An example Dataflow is shown below.
Load Data
Once all the elements are in place as described above, the next step is to load the data which is done via the Convert option on the Data Menu.
Data can be loaded from a file of via a URL (for example from the SDMX Global Registry site).
To successfully validate, the data must adhere to the SDMX standard in terms of format as well as what has been defined in the Data Structure.
You can read more about Data Structures in this article.
You can read more about how to create a simple Data Structures in this article.
Supported formats:
- SDMX-ML (3.0, 2.1 , 2.0 and 1.0)
- EDI
- SDMX-JSON
- SDMX-CSV
- XLS / XLSX
Unsupported formats:
- Any other CSV format (Fusion CSV, Time Series CSV)
- RDF, Turtle, etc.
To see this process in action you can watch this video.
Validate Data
Click Load Data to start the validation process as explained in the image below.
Validation Scheme
What is a Validation Scheme?
Validation Schemes define one or more validation rules which can be executed against a Dataflow at the data validation stage of a data load. Each validation rule consists of a mathematical expression or a link to an aggregation hierarchy which is used to create an expression. This validation goes beyond syntactical and semantical validation of the dataset and is instead checking that the values supplied in the dataset conform to specific business rules. Examples of this could be that a particular field must have a value less than 100, or that the total value reported must be the same as the total of a set of other observation values.
A Validation Scheme must be assigned against a single Dataflow and may consist of one or many validation rules. A single Validation Rule consists of:
- an ID and name
- an optional description
- a type: either a custom expression or a hierarchic expression
- a result type (either numerical or code type) and a value (e.g. 100 or GDP)
- an equality operator (one of the following mathematical operators: =, <>, <, <=, >, >= )
- an expression (e.g. [EUR]+[FR] )
The Validation Scheme rules will be applicable to all datasets submitted against the Dataflow the Validation Scheme is linked to.
How are Rules Applied
A validation rule operates on a single dimension, an example of a rule to calculate Total from the inputs Males and Females would look like the following:
[T] = [M] + [F]
Note: the syntax used in a validation scheme puts code Ids into square brackets.
This rule would be applied to every series where all other parts of the series key match, so the following series there would be two matches to this rule, one for employment, and one for unemployment.
For a validation rule to be executed there must be data reported for the output, and at least one of the inputs. If data are missing in the inputs, then they are treated a zero value. In the following example, only 1 rule is matched, and there is only one input (Male).
There are two types of validation rules, ones which use a custom written expression, as described above. The second type references a Hierarchy in the Registry, and the Hierarchy is used as the basis for an Aggregation expression. For example, the following image shows a hierarchy of countries, against theoretical reported values. This is an example of a hierarchy being used to validate a dataset. A hierarchy can be applied to any dimension that uses the same Codelist as the Hierarchy. When values are read in the data file, the totals at each sub-hierarchy are summed up to ensure they are consistent with the parent value. If any values are missing data, they are treated as having a value of zero.
Note: the Registry only checks data in the submitted file, and does not cross check against any persisted data when validating. For example, if you have already stored the totals in a Registry database, submitting a Dataset containing the values making up the totals, the Registry will not validate from the file against the totals already stored.
Validation Scheme Tutorial
This tutorial describes the manual steps in the process to create a Validation Scheme. It is required that your Registry be populated with structures that support this process (such as Data Structure Definitions and Dataflows).
Overview
This section will explain the creation of a simple Validation Scheme that demonstrates when Data is loaded into the Registry that the rules within the scheme will be used for validation purposes.
Creating the Validation Scheme from the User Interface
A Validation Scheme is created or maintained by using the authoring Wizard. The cogs icon used to open the wizard, is only available to authenticated Agency or Admin users.
There are four steps to creating a Validation Scheme and are described in the following sections:
Step 1: Details
The first step allows for the specification of the ID and other associated high-level values which make up the definition of the Validation Scheme.
Like all other SDMX Structures, the Id, Version, Name and Agency are mandatory, and the other values are optional. Enter these values and click on the Next button.
Step 2: Attachment
This step assigns the Validation Scheme to a single Dataflow.
Click on the Dataflow field which will show all Dataflows in the Registry. Select the Dataflow to attach this Scheme to and its details will be displayed in the lower half of this page.
Click on the Next button once the correct Dataflow has been selected.
Step 3: CSV Import
This step is optional and allows for the creation of several validation rules via text entry.
To create a rule, enter the rule details on its own line in the text area. Elements of the rule must be separated by commas. Each element must conform to the expected value of that field (e.g., the ID element must not contain illegal characters; the equality operator may only be one of the permitted values). The field order is determined by the drop-downs at the top of this step and can be changed by the user.
Syntax of an Expression:
Each expression operates on a single dimension, and is made up of three parts:
- Resultant
- Equality Operator
- Operands and Operators
The resultant and operands can either be numerical constants (e.g., 100) or the Id of a code (e.g., EUROPE). If a code id is used, the syntax demands it is placed in square brackets [EUROPE].
For example, an expression to validate the reported value for EUROPE may look like the following:
[EUROPE]=([UK]+[FR]+[DE]) - (([ES]+[IT]+[LUX]) * 2)
When importing this via CSV the output, equality operator, and expression are all supplied in their own CSV field. So, a full example of a CSV rule would look like the following:
SERIES_CHECK,Europe Total,REF_AREA,[EUROPE],=,"([UK]+[FR]+[DE]) - (([ES]+[IT]+[LUX]) * 2)"
This would create the rule with the following characteristics:
- Rule Id: SERIES_CHECK
- Rule Name: Europe Total
- Rule Description: <none> (no column index supplied)
- Dimension Id: REF_AREA
- Mathematical expression: EUROPE = (UK+FR+DE) - ((ES+IT+LUX) * 2)
Step 4: Expression Builder
Rather than use the CSV import (step 3) to create a validation rule, it may be easier to create a validation rule through the user interface on step 4.
The left-hand side of the page shows all dimensions for this Dataflow followed by a number. The number indicates the number of Validation Rules against that dimension. The right-hand side shows the rules that have been defined for the currently selected Dimension. Selecting a rule shows the expression for that rule (i.e., the logic of the rule itself) or if this is a hierarchical rule then the aggregation hierarchy is displayed.
Editing or creating a rule displays a dialog where the values for the rule can be defined. This is similar to step 3 of the Validation Scheme Wizard.
As stated previously a rule must have an Id and Name and may have an optional description. The control at top-right allows the user to distinguish between a Custom Expression and a Hierarchical Expression (this choice is only available for Dimensions with a Concept that represents a Hierarchic codelist).
The Mathematical Expression can be defined as either a Code evaluated against an expression (e.g., EUROPE=FR+DE) or a value evaluated against an expression (e.g., 55 >= GE). The expression area may be typed into manually and the user will need to do this in order to add mathematical symbols and numbers as appropriate. However the Quick Code Lookup, situated at the bottom of this page, lists all of the codes for this Dimension. Clicking an item in this list will insert it into the expression area.
For Hierarchic Rules, these can only be created when the Codelist referenced by the Dimension is also referenced by a Hierarchy – this permits the Rule Type option of Aggregate Using Hierarchy to be selected. If no hierarchies exist which use the same Codelist as the Dimension, the Rule Type option will be fixed on Custom Rule.
When the rule type is set to Aggregate using Hierarchy the result and expression section of the dialog is replaced with a list of available Hierarchies that can be selected (under the heading “Hierarchy to Use”). A Hierarchy must be selected, and the view of the selected hierarchy is shown to the right of the hierarchical list.
Once the rule has been defined, clicking “Add Rule” will dismiss the dialog and add this rule to the selected dimension.
Note that during creation, rules are not validated for logic or contradictory values. So, it is possible to create nonsensical rules such as EUROPE >= EUROPE. It is down to the user to ensure that the Validation Rules have a logical sense.