Difference between revisions of "Structure Map"

From FMR Knowledge Base
Jump to navigation Jump to search
(Pattern match input, use matched text on output)
(If-Then-Else (default if not specified))
Line 102: Line 102:
  
 
== If-Then-Else (default if not specified) ==
 
== If-Then-Else (default if not specified) ==
The order of rules in a Representation Map can be important, specifically when using regular expressions.  The regular expressions are tested in the same order that the appear they are defined, this allows for more specific expressions to be tested before a general catch all.  For example
+
The order of rules in a Representation Map can be important, specifically when using regular expressions.  The regular expressions are tested in the same order that the appear they are defined, this allows for more specific expressions to be tested before a general catch all.   
 
 
Rule 1 (no regex):  A  -> B  (A maps to B)
 
Rule 2 (reg ex):  A\dB  -> B2 (A followed by a number followed by B maps to B2)
 
Rule 3 (reg ex): .* -> _Z        (anything maps to _Z)
 
  
 +
'''<u>Example</u>'''
 +
<br>
 +
'''Rule 1''' (no regex):  A  -> B  (A maps to B) <br>
 +
'''Rule 2''' (reg ex):  A\dB  -> B2 (A followed by a number followed by B maps to B2)<br>
 +
'''Rule 3''' (reg ex): .* -> _Z        (anything maps to _Z)<br>
 +
<br>
 
The input data will first be checked against the exact match (A maps to B) followed by the more specific rule2, A1B would map to B2, if that fails then the output will be _Z.
 
The input data will first be checked against the exact match (A maps to B) followed by the more specific rule2, A1B would map to B2, if that fails then the output will be _Z.
  

Revision as of 10:22, 28 January 2021

Overview

Simple example of a structure map

SDMX Version 3.0 Structure Mapping provides the ability to define a relationship between datasets conforming to a source DSD, to datasets which conform to a target DSD. This relationship allows for the automatic conversion of data from one structure to another. For example a source dataset may contain 8 Dimensions and use certain coding schemes, which may map to a dataset with only 5 Dimensions using different coding schemes.

Structure Mapping does not create new data, it should not be thought of as a mechanism to aggregate data, only to recode it, and change the shape of the data by adding/removing or renaming Dimensions and Attributes.

Two Maintainable Structures are required to defining how two datasets relate to each other, one if the Structure Map, used to define which source Components map to which target Components, and the other is the Representation Map, used to describe how values reported against the source Components map to values in the target Component.

A simple example is a relationship between the source COUNTRY component to the target REF_AREA component. The Structure Map has a source COUNTRY and target REF_AREA, the rules used to define how the values are mapped are maintained in the corresponding Representation Map, an example rule would be GB maps to GBR, US maps to USA, and UY maps to URY.

Structure Map

Source and Target DSD / Dataflow

The Structure Map defines the source and target DSD or if the mapping is Dataflow specific, the source and target Dataflow. The source DSD/Dataflow is the one from which the data will be input into the mapping. Whilst the mapping rules are bi-directional (data mapped one way can be mapped back again) for more complex mappings, which include regular expression matches or substring matches on the source, it is not always possible to map back again. Therefore the source DSD or Dataflow should be selected based on where the data is coming from, and the target is where the data is going to as a result of the mapping.

Component Maps

The Structure Map defines 1 or more Component Maps, each Component Map has 1 or more Components from the source DSD, mapping to 1 or more Components in the target DSD.

1 to 1 Mapping

The most simple relationship is 1:1 mapping, an example being REF_AREA maps to COUNTRY.

1 to n Mapping

More complex relationships allow for the combination of reported values across multiple source components to influence what is output in the target, for example REF_AREA in combination with CURRENCY maps to CURRENCY_DENOM.

n to 1 Mapping

Component Maps may define single source mapping to multiple targets, for example REF_AREA maps to CURRENCY and CURRENCY_DEMOM.

n to n Mapping

Finally multiple sources may map to multiple targets. A full description of how mapping relationships are used to solve use cases is provided in the mapping relationships section.

Defining how values map

Finding the mapped output using the intersection of each output set

Each Component Map can link to a Representation Map, which is used to describe how the source values map to the target values. The linked Representation Map links to source and target Codelists, Valuelists, or Free text. Like the Component Map, the Representation Map may contain multiple sources and multiple targets. The number and order of sources and targets to a Representation Map must match exactly that of the Component Map. For example if a Component Map has 2 sources REF_AREA and CURRENCY, then the linked Representation map must also have 2 sources, one for the REF_AREA Codelist and the other for the CURRENCY Codelist.

Representation Maps can include complex rules, such as regular expressions on source values, and can even define periods of time for which a mapping relationship is true, for example if a relationship between source country and target currency is defined then one could map France to the French Franc up until 2002 and then map France to the Euro from 2002 onwards.

If a values do not require mapping, for example if source FREQ maps to target FREQ and the values are the same in both the source and target DSD, then the Component Map should not link to a Representation Map. The lack of link will inform the system that the value should be copied across verbatim.

Time Mapping

A Structure Map may also define relationships between a source Component and target Component where the value in the source is a representation of time, which needs to be converted to conform to SDMX Time Formatting.

The output of a time mapping is always a date in SDMX format. The date must be mapped to a specific Frequency formatting, this is either defined as a fixed value, or can be linked to the value in another Dimension. For example output Frequency=value reported in FREQ Dimension (in the target DSD).

Time mapping is split into two separate types, Epoch Mapping and Time Pattern Mapping.

Epoch Mapping

This is used if the source Component represents time as a number, the number represents the number of epochs since a base period, and each epoch is a fixed interval of time. An example is UNIX time which is the number of milliseconds since 1970. When defining this mapping, the source Component requires the base period (e.g. 1970), the epoch interval (milliseconds) and the output Frequency (fixed value or based on another dimension value).

Time Pattern Mapping

This is used if the source Component represents time as a string which conforms to a particular pattern, for example mm-dd-yyy (month, day, year). Each pattern can be mapped to a specific frequency, or the general rule of outputting according to a Frequency Dimension can be used.

The following patterns are supported: [Java Simple Time Format]

Components with Fixed Values

Some Components on the output may have a fixed value, for example Frequency is always M regardless of the input data. This is defined at the level of the structure map. As mappings can be bi-directional the input can also have a fixed value, so when mapping the other way (from target to source) the input becomes the output.

Mapping Rules: Use Cases

Many Outputs from a Single Source

An example use case is a Dimension called UNIQUE_KEY whose values are used to uniquely define a series, example SER1, SER2, SER3. The mapped DSD splits this into multiple Dimensions FREQ, REF_AREA, INDICATOR. The mapping rules split the unique key SER1 into a FREQ:M, a REF_AREA:UK and an INDICATOR:EMPLOYED, another unique key would map to different values, for example SER2 maps to FREQ:M, a REF_AREA:FR and an INDICATOR:EMPLOYED.

Solution 1

This type of use case can be solved by creating 3 Component Maps:

Component Map 1: Source=UNIQUE_KEY Target=FREQ Component Map 2: Source=UNIQUE_KEY Target=REF_AREA Component Map 3: Source=UNIQUE_KEY Target=INDICATOR

Each Component Map is backed by a Representation Map, which maps the value of the Unique Key to the output. Example

Representation Map = UNIQUE_KEY -> FREQ Values: SER1=M, SER2=M, SER3=A

Representation Map = UNIQUE_KEY -> REF_AREA Values: SER1=UK, SER2=FR, SER3=UK

Representation Map = UNIQUE_KEY -> INDICATOR Values: SER1=EMPLOYED, SER2=EMPLOYED, SER3=EMPLOYED

In this example, the unique Key SER1 would be mapped to M, UK, EMPLOYED.

Solution 2

An alternative solution to this mapping is to create a single Component Map which maps the source UNIQUE_KEY to three outputs FREQ, REF_AREA, INDICATOR.

A Single Representation Map is required to map each UNIQUE KEY to the three outputs. Representation Map = UNIQUE_KEY -> FREQ:REF_AREA:INDICATOR Values: SER1=M:UK:EMPLOYED, SER2=M:FR:EMPLOYED

Summary

The choice of whether to split the mapping up into separate components vs a single rule should be based on what will be more maintainable, understandable, and can individual mapping rules be reused by other structure maps.

Many Sources map to a Single output

This is the reverse of the above use case, and has the same 2 solutions, split the rules into individual maps, or describe the relationship in a single map. If the rule is split into individual maps, then it is important to note how Fusion Registry determines the final output.

If the output from Component Map 1 is a set of possible value: FR, DE and the output from Component Map 2 is another set of possible values, DE and UK, then the intersection of both sets are used to find the final output, in this case DE.


If-Then-Else (default if not specified)

The order of rules in a Representation Map can be important, specifically when using regular expressions. The regular expressions are tested in the same order that the appear they are defined, this allows for more specific expressions to be tested before a general catch all.

Example
Rule 1 (no regex): A -> B (A maps to B)
Rule 2 (reg ex): A\dB -> B2 (A followed by a number followed by B maps to B2)
Rule 3 (reg ex): .* -> _Z (anything maps to _Z)

The input data will first be checked against the exact match (A maps to B) followed by the more specific rule2, A1B would map to B2, if that fails then the output will be _Z.

Pattern match input, use matched text on output

A rule can be defined to match a specific pattern, which is then used in the output. For example the rule can state any three characters followed by a number is converted to the same three characters without the number. This can be satisfied by using Regular Expressions to match the input, with a capture group. A capture group is where the regular expression rule is in parentheses which can then be referred to by number (capture group 1, 2, 3, and so on).

Example:
RegEx Input: ([A-Z]{3})_([0-9])
Output Expression: \2_\1

This example consists of 2 capture groups:

  1. ([A-Z]{3}) Any A to Z character 3 times
  2. (0-9) the number zero to nine


The output expression then reverses the order of the information by outputting capture group 2, an underscore, followed by capture group 1.

An example input for the above expression, and corresponding output is as follows:
Input = ABC_1
Output = 1_ABC

Pattern match input to used on second input

This use case is where there is more then one source Component for a Mapping, example CURRENCY and REF_AREA. The value for one of the source components is based on a pattern, and the value for the second component is based on what matched the first pattern.

Like the pattern match on output, this rule makes use of regular expression capture groups to copy matched information from one rule to another. The capture group (everything matched in the parenthesis) is referred to by number, with a leading slash \.

For example: CURRENCY= (.*) REF_AREA= \1_X

In this example the first input matches on anything, but the REF_AREA rule is using the matched value from CURRENCY, defined by capture group \1, followed by _X. The following shows a match and a miss:

Match = CURRENCY=USD, REF_AREA=USD_X Miss = CURRENCY_USD, REF_AREA=US_X

Mapping UNIX Time

If the source Component has a timestamp in UNIX Time, an Epoch Map should be created which maps from the source Component to target Component, with the base period set to 1970, and the epoch set to milliseconds.

Testing Structure Map

Structure Mapping can be a complex task, if multiple sources influence many targets, and rules include regular expressions and substring matches, then it is important to test the mapping to ensure the outputs are as expected.

The Fusion Registry provides a testing feature, to use this, first login to the Fusion Registry, navigate to the Structure Map page, select the Structure Map to test and click on the Test Mapping button. A data file must be loaded which conforms to either the source or the target DSD. The datafile may be in SDMX format, or CSV. If loading CSV, ensure the column headers match the Component Id.

Mapping reports are generated at the level of each row of information, if loading time series data, a row should be thought of as a single observation along with all the series information. The report will describe the output row for each input row.

The Fusion Registry can perform finer grained reports detailing exactly how a specific row was mapped, in the User Interface this is achieved by clicking on the row. The mapping report is broken down into each mapping rule in the structure map, what the input was for the rule, what and what the output were.

Conversion of Data

The Fusion Registry Transformation services can use a Structure Map to apply to the conversion of data. Asynchronous Transformation Synchronous Transformation