1 of 37

Data

Open Data Source

This menu item allows opening the file or the database selector and then starts the Data Import Wizard.

Text *file:* Once the file is read and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available.
Database: Once the database table is loaded and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available.
Recent databases: Keep a list of the recently opened databases. The Data importation wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus .

Associate Data Source

This menu item allows opening the Data association wizard in order to associate data from a text file or a database with an existing Bayesian network.

Recent databases: Keep a list of the recently opened databases. The Data association wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus .

When the network structure is modified during the association (addition of nodes or states), the conditional probability tables are automatically recomputed from the database. If the structure re- mains unmodified, the conditional probability tables are not modified.

Associate Dictionary

This menu item allows defining the properties of the active Bayesian network thanks to text files. These properties concern arcs, nodes and states:

Arc:
- Arcs: allows associating a set of arcs to the network. The indicated arcs can be added or removed from the network. The arc removal will always be done before adding an arc. Before adding an arc, all the constraints belonging to the Bayesian network as well as the arc constraints and the temporal indices will be checked. If a constraint is not verified, then the arc won't be added.
- Forbidden Arcs: allows associating with the network a set of forbidden arcs .
- Arc Comments: allows associating with the network a set of arc comments .
- Arc Colors: allows associating with the network a set of colors on the arcs.
- Fixed Arcs: allows defining if some arcs are fixed or not.
Node:
- Node Renaming: allows renaming each node with a new name. These new names must be, of course, all different.
- Comments: allows associating a comment with each node that is in the file.
- Classes: allows organizing nodes in subsets called classes . A node can belong to several classes at the same time. These classes allow generalizing some node's properties to the nodes belonging to the same classes. They allow also creating constraints over the arc creation during learning.
- Colors: allows associating colors with the nodes or classes that are in the file. The colors are written as Red Green Blue with 8 bits by channel in hexadecimal format (web format): for example the color red is 255 red 0 green 0 blue, it will give FF0000. Green gives 00FF00, yellow gives FFFF00, etc.
- Images: allows associating colors with the nodes or classes that are in the file. The images are represented by their path relatively to the directory where the dictionary is.
- Costs: allows associating with each node a cost . A node without cost is called not observable.
- Temporal Indices: allows associating temporal indices with the nodes that are in the file. These temporal indexes are used by the BayesiaLab's learning algorithms to take into account any constraints over the probabilistic relations, as for example the no adding arcs between future nodes to past nodes. The rule that is used to add an arc from node N1 to node N2 is:
- If the temporal index of N1 is positive or null, then the arc from N1 to N2 is only possible if the temporal index of N2 is greater of equal to the index of N1.
- Local Structural Coefficients: allows setting the local structural coefficient of each specified node or each node of each specified class.
- State Virtual Numbers: allows setting the state virtual number of each specified node or each node of each specified class.
- Locations: allows setting the position of each node.
State:
- State Renaming: allows renaming each state of each node with a new name.
- State Values: allows associating with each state of each node a numerical value .
- State Long Names: allows associating with each state of each node a long name more explicit than the default state name. This name can be used in the different ways to export a database, in the html reports and in the monitors.
- Filtered States: allows defining a state to each node as a filtered state .

As indicated by the syntax, the name of the node, class or state in the text file cannot contain equal, space or tab characters. If the node names contain such characters in the networks, those characters must be written with a {color} (backslash) character before in the text file: for example the node named Visit Asia will be written Visit\ Asia in the file.

In order to specifically differenciate a nam which is the same for a classe, a node or a state, you must add at the end of the name the suffix "c" for a class, "n" for a node and "s" for a state.

If your network contains not-ASCII characters, you must save your own dictionaries with UTF-8 (Unicode) encoding. For example, in MS Excel, choose "save as" and select "Text Unicode (*.txt)" as type of file. In Notepad, choose "save as" and select "UTF-8" as encod- ing. If your file contains only ASCII character you can let the default encoding (depending on the platform) but it is strongly encouraged to use UTF-8 (Unicode) encoding in order to create dictionary files that doesn't depend on the user's platform. So, for example, a chinese dictionary can be read by a german without any problem whatever the used platforms are. If you are not sure how to save a file with UTF-8 encoding, you should export a dictionary with BayesiaLab, modify and save it (with any text editor) and load it in BayesiaLab.

Export Dictionary

This menu item allows exporting the different kinds of dictionaries in text files.

The dictionary files are saved with UTF-8 (Unicode) encoding in order to support any character of any language. An option, in the Import and Associate preferences: Save Format , allows saving or not the BOM (Byte Order Mask) at the beginning of the file. The BOM increases the compatibility with Microsoft applications. On other platform like Unix, Linux or Mac OS X, the BOM is not necessary and, in come cases, is considered as simple extra characters at the beginning of the file.

Associate an Evidence Scenario File

This menu item allows associating an evidence scenario file with the network.

Export an Evidence Scenario File

This menu item allows exporting into a text file an evidence scenario file associated with the network.

Save Data

This menu item allows saving the base associated with the network including the results of the various pre-processing that have been carried out within the data importation wizard (discretization, aggregation, filtering,). If the imported database still contains missing values and if the selected algorithm to process the missing values is one of the two imputation algorithms (static or dynamic), then option will allow you to realize all your imputation tasks by saving a database without any missing values. Indeed, each missing value is replaced by taking into account its conditional probability distri- bution, returned by the Bayesian network, given all the known values of the line. If the database contains data for test and data for learning, the user can choose which kind of data he wants to save: only learning data, only test data or the whole data. It is also possible to save only the data corresponding to the selected nodes.

The states' long name can be saved instead of the states' name. The numerical values in the database associated with the continuous nodes can be saved if they exist. If there is no numerical values asso- ciated with the database and if the option is checked, the numerical values will be created by randomly generating a value in each concerned interval. If the database contains weights, they will be saved as the first column in the output file.

Imputation

Allows the imputation of the missing values of the associated database according to the mode selected in the following dialog box:

The data will be saved in the specified file and the long name of the states will be used as specified. If the database contains data for test and data for learning, the user can choose on which kind of data he wants to perform imputation: only learning data, only test data or the whole data. The states' long name can be saved instead of the states' name. The numerical values in the database associated with the continuous nodes can be saved if they exist. If there is no numerical values associated with the database and if the option is checked, the numerical values will be created by randomly generating a value in each concerned interval. However, if there are numerical values in the database, the missing numerical values will be generated from the distribution function of each interval. If the database contains weights, they will be saved as the first column in the output file.

Graphs

Opens the graph editor if a database is associated with the current network.

Open Data Source (Data Import Wizard)

Context

The Data Import Wizard is the principal tool in BayesiaLab for preprocessing and importing external data.

Data Sources

You can use BayesiaLab's Data Import Wizard to import data from two types of sources:

Data tables in text format, in which data fields are separated by delimiters, such as comma, semicolon, tab, or pipe "|". The most common format is CSV.
Data tables in SQL-compatible databases can be accessed via a JDBC driver. Third-party JDBC drivers are available for all major databases.

All data sources must be structured as a single table, i.e., with rows and columns. All table joins must be performed before importing the data in BayesiaLab.

Usage

To launch the Data Import Wizard for a data table in a
- text file, select Main Menu > Data > Open Data Source > Text File.
- database, select Main Menu > Data > Open Data Source > Database.

Then, the Data Import Wizard guides you through five sequential steps. The first step of the Data Import Wizard depends on the data source, i.e., text file or database. All subsequent steps of the Data Import Wizard are the same for both types of data sources.

Data Structure Definition
- Data table in a text format
- Data table in a database
Definition of Variable Types
Data Selection, Filtering, and Missing Value Processing
Discretization and Aggregation
Import Report

Step 1 — Data Structure Definition: Text File

Context

Open Data Source (Data Import Wizard) brings data into BayesiaLab to create a new Bayesian network.

BayesiaLab can load data from flat text files (e.g., CSV, TXT) or connected databases.

Usage

In Step 1 — Data Structure Definition: Text File of the five-step Data Import Wizard, you need to define the dataset structure for BayesiaLab so that the data can be imported and interpreted correctly.
The Data Structure Definition window opens up.
Specify all Settings & Options (see below).
Click Next to proceed the Step 2 — Definition of Variable Types.

Many of the settings can be immediately reviewed and validated in the Data Preview panel. However, Missing Values or Filtered Values can be mischaracterized and yet go unnoticed and, later, introduce major problems causing misleading analysis results.

Separators

The Data Import Wizard will attempt to automatically identify the separator or delimiter of the fields in the data table.

However, there can be ambiguous situations in which you need to specify the separator by checking the appropriate box:

Tab
Semicolon
Comma
Space
Other

If you prepare a dataset externally for import into BayesiaLab, ensure that separators are unique and do not appear as content in any data field. So, if any data fields contain text with commas as content, you cannot use commas as the separator. In such a case, try a tab or semicolon.

Encoding

The Encoding drop-down list allows you to select an alternative encoding for the dataset to be imported. This can become necessary for importing data from certain legacy systems.

Missing Values

Specifying the correct code for Missing Values is very important so that BayesiaLab can process such Missing Values appropriately.

The list shows a number of codes that are commonly used for Missing Values. However, this is not necessarily comprehensive, and your dataset may contain different codes, such as "." (dot) or "-9999", etc.

Click Add to create a new entry in this list for the current data import.
Clicking Remove deletes the selected entries.

Deleting a default entry such as NR (for no response) may become necessary, for instance, if a data field contains the string "NR" as a valid value. That would be the case if your data set included New York Stock Exchange ticker symbols. In this context, "NR" would be the symbol of Newpark Resources, Inc. Unless you address this issue, all "NR" strings would be treated as Missing Values.

You can set your own default list of codes under Main Menu > Windows > Preferences > Data > Import & Associate > Missing & Filtered Values.

Filtered Values

Just as important as the correct definition of Missing Values is a clear understanding of a Filtered Value.

A Filtered Value occurs when a variable cannot have any value for logical reasons. For instance, in a demographic dataset, there could be a field Age at Retirement. However, in the record of a 16-year-old high school student in this dataset, there could be no value for the field Age at Retirement. However, this situation must not be treated as a Missing Value! A Missing Value implies that a value exists but is unknown. In the case of the student's record, a value is logically impossible, not missing. So, instead of a numerical value or a blank, you must specify a code that says that there can be no value. This is the purpose of assigning a Filtered Value code.

Importantly, you must encode any Filtered Values before importing your dataset into BayesiaLab. In BayesiaLab, you merely need to declare what code you used in your dataset to represent Filtered Values. BayesiaLab will create a Filtered State as an additional state in each node for which Filtered Values are encountered during data import.

Click Add to create a new entry in this list for the current data import.
Clicking Remove deletes the selected entries.

You can set your own default list of codes under Main Menu > Windows > Preferences > Data > Import & Associate > Missing & Filtered Values.

In Data Preview, all Filtered Values are marked with an asterisk (*) in the data table.

Understanding the difference between Missing and Filtered Values is critically important.

Sampling

Clicking Define Sample button opens a window that allows you to sample records from your data source.

This is particularly useful for the preliminary analysis of large datasets. By default, BayesiaLab imports all records from the data.

You can define a subset in three ways:

Random Sample — Size in Percent: specify the size of the random sample as a percentage of the original dataset size.
Random Sample — Size: specify the number of records in the sampled dataset.
Custom Range — First Row to Last Row: specify the range of records to be imported.

Checking the option Fixed Seed and specifying a number ensures that you can repeat exactly the same random sampling for each iteration of the import. This allows you to reproduce your results as you develop your model.

Learning/Test

By default, the Data Import Wizard loads the entire dataset as a Learning Set.

By clicking the Define Learning/Test Sets button, you can set aside a Test Set (or holdout sample).

You can define the Learning Set/Test Set split in three ways:

Random Test Set — Size in Percent: specify the size of the Test Set as a percentage of the original dataset size.
Random Test Set — Size: specify the number of records in the Test Set.
Custom Test Set — First Row to Last Row: select a specific range of records for a Test Set.

Checking the option Fixed Seed and specifying a number ensures that you can obtain the same Test Set with each iteration of the import. This allows you to reproduce your results and validation measures as you develop your model.

In addition to specifying a Learning Set/Test Set split here, you can define a split in other ways:

You can designate a variable in the original dataset to assign records to the Learning Set and Test Set. You can select such a variable in the next step of the Data Import Wizard: Step 2 — Definition of Variable Types.
Main Menu > Data > Data Set > Generate Learning/Test Split

Furthermore, you can remove the Learning Set/Test Set split at any time:

Main Menu > Data > Data Set > Remove Learning/Test Split.

Options

The Options Panel allows you to manage the interpretation of the to-be-imported dataset.

Title Line:
- By checking this option, BayesiaLab reads the first row of the dataset and uses its values as column headers.
- If the values in the first row are not compatible, e.g., due to missing values or duplicate values, you are prompted to accept the proposed corrections, which include adding suffixes for duplicate names and substituting missing values with generic column headers, e.g., N0, N1, N2, etc.
End of Line Character:
- With some files, it may be necessary to specify a certain character so that BayesiaLab can correctly detect the end of a row in a data table.
Consider Identical Consecutive Separators as One:
- Check this box so that if you have multiple consecutive separators of the same type, e.g., “;;;”, the Data Import Wizard will treat them as a single separator.
Consider Different Consecutive Separators as One:
- Check this box so that if you have multiple consecutive separators of any type, e.g., “;,|”, the Data Import Wizard will treat them as a single separator.
Double Quotes:
- Remove
- As String Delimiters
Simple Quotes:
- Remove
- As String Delimiters
Transpose:
- By default, BayesiaLab expects the data source to be arranged in
  - columns corresponding to variables and
  - rows corresponding to samples, records, or observations.
- Checking the Transpose option allows you to accept an alternate format, i.e.,
  - rows corresponding to variables and
  - columns corresponding to samples, records, or observations.
- The transposed format is commonly used in bioinformatics. For instance, variables representing genes — sometimes tens of thousands — are arranged row by row. Observations — sometimes only a few dozen — are placed in columns side by side.

The data table at the bottom of the window provides a preview of how the Data Import Wizard sees and interprets your dataset.

Blank fields indicate a Missing Value.
Asterisks (*) mark Filtered Values. In the dataset shown below, for instance, Filtered Values were assigned to all males and post-menopausal women for the variable Pregnancy Status. For those two groups and for obvious reasons, pregnancy is impossible.
Horizontal and vertical sliders allow you to scroll and view the entire dataset. Alternatively, you can move your mouse's scroll wheel up and down.
If a variable name exceeds the column width, you can click on the divider between column headers and drag it into the desired position. Alternatively, double-click the divider to auto-fit the column width to the variable name.

Workflow Animation

In the following animation, we show a dataset that requires numerous settings to be adjusted for proper import:

The dataset uses the pipe character ("|") as a delimiter.
All fields are enclosed in double quotes.
Multiple, arbitrary codes are used for Missing Values:
- "Refused"
- "unknown"
"Not Applicable" is the code for Filtered Value used in this dataset.

Note that there are no standardized codes for Missing Values and Filtered Values. They can be as arbitrary as in this example. Therefore, it is of utmost importance that whoever prepares the dataset must convey the precise meaning of these codes to the analyst who imports the data into BayesiaLab.

Step 2 — Definition of Variable Types

Context

In Step 2 — Definition of Variable Types of the five-step Data Import Wizard, you need to define variable types.
Step 2 contains four panels that relate to each other in their content and available actions.

Overview of Elements in Step 2

Type

With the radio buttons in the Type panel, you can define the type of each variable.
Before you start making your determinations, BayesiaLab has already made some guesses regarding the appropriate variable type, i.e., Discrete versus Continuous.
Furthermore, some variables have limited options regarding the variable type because of their distributions:
- If a variable has the same value for all observations, it falls into the Unused variable type. Such a not-distributed variable cannot be imported at all into BayesiaLab.
- Variables that contain any text values cannot be declared Continuous variables.
- Variables with Missing Values cannot be of the type Weight, Row Identifier, or Learn/Test.

Usage

You can perform the selection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
The current selection is highlight by showing the selected columns in a darker shade of their current color.

Discrete

The Discrete type considers each unique value of the variable a distinct state.
Any variable that contains text will be considered Discrete by default.
The maximum number of unique values that can be accommodated can be specified under Main Menu > Window > Preference > Editing > Node > Maximum Number of States.

Continuous

The Continuous type applies to numerical variables, which must be discretized in Step 4 — Discretization and Aggregation.
If a variable contains integer values above a certain threshold, the variable will be considered Continuous.
You can specify this threshold under Main Menu > Windows > Preferences > Data > Import & Associate > Threshold for Assuming Integers as Continuous. The default threshold value is 5.

Learn more about Discrete and Continuous nodes in the Node Editor topic.

Weight

Weighting is often applied to surveys to make a survey sample representative of the demographics of the underlying population.

If your dataset contains such a Weight variable, select it by clicking on the corresponding column.
Then, select the Weight button in the Type panel.
Later, in Step 4 — Discretization and Aggregation, you can specify whether or not to normalize the Weight variable.

Learning/Test

For a dataset that has already been split into a Learning Set and a Test Set, you can use such an existing definition to import your data into BayesiaLab.

Both the Learning Set and the Test Set need to be in the same data table, rather than in separate files.
A binary indicator variable needs to identify each set with a unique code.
With a Learning/Test variable defined, in Step 4 — Discretization and Aggregation of the Data Import Wizard, you need to assign which of your codes corresponds to BayesiaLab's Learning and Test states.

Row Identifier

You can assign one or more variables to serve as Row Identifiers. The values of Row Identifiers are imported but not processed in any way. They serve as labels that are attached to each record.

There are numerous functions in BayesiaLab that allow you to look up what record in the dataset corresponds to what is currently on display on the screen.
For instance, Automatic Evidence-Setting displays the Row Identifier in the Status Bar.

Unused

By selecting the Unused button, you can skip the import of the selected variables. In previous versions of BayesiaLab, this option was also known as "Not Distributed."

Unused is automatically applied to variables containing only a single value across all observations, i.e., when the variable is "not distributed," hence the original name.
Unused variables will appear grayed out in the remaining steps of the Data Import Wizard.

Multiple Typing

The Multiple Typing panel allows you to quickly assign variable types across multiple variables.

By clicking either button, all previous type assignments are replaced.

You can automatically remove variables, i.e., set them to the Unused type, if they exceed a certain column percentage of Missing Values.

Click the Set Missing Values Threshold button.
From the pop-up window, set the percentage.

All variables that exceed the specified threshold are set to Unused.

Information

The Information panel provides a range of statistics relating to the current type assignment of variables:

Number of Rows refers to the number of records in the to-be-imported datasets. In the context of datasets, rows, records, cases, samples, and observations all have equivalent meanings.
Others displays the count of all the variable assigned to the types Row Identifier, Weight, or Learn/Test.
Unused shows the absolute count of variables currently assigned to the Unused type. The percentage refers to the proportion of Unused variables among all variables.
Missing Values displays the count of cells in the dataset that contain Missing Values. The percentage refers to the proportion of cells in the dataset that contain Missing Values, including all variables types, even Unused, Row Identifier, and Learning/Test.
Filtered Values displays the count of cells in the dataset that contain Filtered Values, as indicated by the asterisk (*). The percentage refers to the proportion of cells in the dataset that contain Filtered Values, including all variable types, even Unused, Row Identifier, and Learning/Test.

Data

Horizontal and vertical scrolling allows you to view the entire dataset that will be imported.

Workflow Animation

Step 3 — Data Selection, Filtering, and Missing Value Processing

Context

Step 3 of the five-step Data Import Wizard deals with Data Selection, Filtering, and Missing Values Processing.

Overview of Elements

Data

This Data panel resembles the Data panel from Step 2 — Definition of Variable Types.

However, there are several important additional pieces of information available:

- For Discrete variables, it shows the frequencies of all states, including Missing Values and Filtered Values:

As you experiment with checking/unchecking, you can see how the Number of Rows in the Information panel changes.

In terms of a data query, the Filter checkbox would be the equivalent of a nominal value row filter.

Note that the number of Filtered Values does not refer to the number of excluded rows due to an unchecked Filter checkbox.

For Continuous variables, it shows the standard statistics, such as Minimum, Maximum, Mean, and Standard Deviation. Additionally, the table displays the frequencies of non-missing values, Missing Values, and Filtered Values:

Select Values

Three actions are available in this panel:

You can choose the logic for combining the Filters and Minima/Maxima assigned in the Data panel:
- OR: a row will be removed if ANY of the selected Filters or specified Minima/Maxima across all variables apply to that row.
- AND: a row will only be removed of ALL of the selected Filters and specified Minima/Maxima across all variables that apply to that row.
Click the Show Selections button to review what Filters and Minima/Maxima are currently in place.
Note the syntax for Discrete variables: The variable name is followed by "in" (i.e., is an element of) followed by the included values shown as an array in square brackets.
Further logical expressions are shown as conjunctions (AND) or disjunctions (OR) in separate lines.

Clicking the Delete Selections button removes all Filters and Minima/Maxima currently in place.

Missing Values Processing

In the Missing Value Processing panel you can specify which kind of processing to apply to variables with Missing Values, i.e., Filter, Replace, and Infer.

Filter

The Filter function allows you to remove rows from the dataset that contain Missing Values. This is equivalent to what is commonly known as casewise deletion.

You can apply the Filter individually to any variable that contains Missing Values.

Usage

Then, check the Filter checkbox in the Missing Values Processing panel.
Next, choose the logical condition to apply when you select multiple variables to be subject to the Filter.
- OR: a row will be removed if ANY of the selected variables contain a Missing Value in that row.
- AND: a row will only be removed of ALL of the selected variables containing a Missing Value in that row.

Before applying Filter, please consider the implications discussed in Chapter 9: Missing Values Processing.

Replace By

With the Replace By function, you can specify a value for replacing the Missing Values in the selected variable.

You have several options in this regard:

You can set a specific value:
- For a Discrete variable, you can select among the values observed in the variable from a drop-down list.

Alternatively, you can choose the Modal value, i.e., the most frequently occurring value of the variable in the dataset.

For a Continuous variable, you can select to use the Mean value computed from the dataset.

As an alternative, you can specify any arbitrary value.

Infer

For practical analysis purposes, the Infer option is the most common method for Missing Values Processing.

The Methods in Detail:

Infer — Static Imputation
Infer — Dynamic Imputation
Infer — Structural EM
Infer — Entropy-Based Imputations

Information

Step 4 — Discretization and Aggregation

Context

Step 4 — Discretization and Aggregation requires you to make several more important choices before concluding the import process.
As opposed to the previous steps, which all consisted of a single screen, Step 4 provides one screen per variable type for six screens.

Overview of Screens

As you go from Step 3 to Step 4, the variable that you last selected in Step 3 remains highlighted.
And depending on the variable type, Step 4 starts with one of six possible screens, one for each variable type. Click on the thumbnails in the following table for a preview.
Note that for Row Identifier and Unused variables, no actions are available. Except for the Data panel, the corresponding screens are blank.

For all other variable types, we discuss all available options in detail in separate sections:

Variable Type-Specific Screens

Weights
Learning/Test
Discretization
Aggregation

Weights

Context

This screen is only available if you designated a Weight variable in Step 2 — Definition of Variable Types.

Usage

Click on that Weight variable in the Data panel, and the Normalize Weights checkbox appears as the only option on the screen.

You need to determine whether to apply Normalize Weights or not:
- If yes, the Weights will be normalized so that the total number of cases considered by BayesiaLab for machine learning is equal to the actual number of samples in the dataset.
- If no, the Weight variable will be treated as representing the actual number of observed cases. So, a weight of 10 for one observation would be treated and counted like ten instances of that same observation. As a result, the total number of cases considered by BayesiaLab would correspond to the population from which the weight was calculated.
- This example illustrates the situation for a survey consisting of 10 observations:
- If you do not normalize, BayesiaLab would consider a sample of 100 for learning purposes and presumably find spurious relationships. This "over-counting" by a factor of 10 has the same effect as reducing the Structural Coefficient to 0.1.
- If you normalize, BayesiaLab considers the correct proportions of the weighted samples but still only considers ten observations in total for learning purposes.

If you have specified a Weight variable, it will be taken into account in the Discretization and Aggregation algorithms.

Learning/Test

Context

This screen is only available if you designated a Learning/Test variable in Step 2 — Definition of Variable Types.

Usage

Select the Learning/Test variable by clicking on its header or into the corresponding column.
Select BayesiaLab's learning and test labels from the drop-down lists to match the codes in your dataset.
Additionally, you can see the proportion of cases for each code in your dataset.

Given that you have a variable of the type Learn/Test, only the "learning" rows will be taken into account for Discretization and Aggregation. Otherwise, you would partially defeat the purpose of having a hold-out set.

Discretization

Context

BayesiaLab requires the discretization of all Continuous variables, and in this screen, you need to specify how to discretize those variables.
The Discretization process determines how a Continuous variable will be imported into BayesiaLab, i.e.,
- the number of intervals (or bins);
- the values of the thresholds which define the ranges of the intervals.
These attributes define the transformation of the underlying Continuous variable in the dataset into a discretized Continuous node in BayesiaLab.

To learn more about the important distinction between Continuous and Discrete nodes, please see these topics:

Continuous Nodes
Discrete Nodes

Usage

Select one or more Continuous variables and click into one of the headers or one of the corresponding columns.
The Discretization panel appears.

Discretization Types Overview

The first item in the Discretization panel is the Discretization Type drop-down menu.
The items on this list can be grouped into Automatic Discretization versus Manual Discretization.
- The bottom item on the drop-down menu, Manual, refers to a Manual Discretization approach in which you have full control over thresholds, etc.
- The remaining eleven items all refer to different kinds of Automatic Discretization.

However, even in Manual Discretization, you take advantage of the algorithms available with Automatic Discretization.

Discretization Types in Detail

Manual Discretization
Automatic Discretization

Manual Discretization

Context

Manual Discretization

Select Manual from the drop-down menu.
Several additional items and buttons appear on the left side, plus a Cumulative Distribution Function (CDF) is shown on the right. This CDF plot can help in selecting appropriate discretization intervals.
In the screenshot below, the variable Standing Height (cm) is selected, meaning that the CDF plot corresponds to that variable.

Click on the Density Function button, and the Probability Density Function (PDF) of the same variable appears.
Now the button reads Distribution Function, and by clicking it, you can toggle back to the CDF view.

By default, only one threshold is placed at the mean value of the corresponding variable.
This threshold appears as a horizontal line on the CDF and a vertical line on the PDF.
The CDF and PDF plots are interactive; you can add, delete, and modify thresholds.

Editing Thresholds

The following instructions apply to both plots:

To select a threshold, left-click on that threshold.
The selected threshold is highlighted in red.
The remaining thresholds on the plot remain blue.
The precise numerical value of a selected threshold is shown in the Threshold Value field to the right of the plot.
To move a threshold, click on it and hold, then move it. Release to fix its position.
The percentages displayed at the end of a selected threshold refer to the share of observations that fall into the intervals above and below this threshold.
Instead of moving the selected threshold with your cursor, you can type a specific value into the Threshold Value field.
To add an additional threshold, right-click with your cursor on the desired position.
To remove an existing threshold, right-click on it to delete it.
A zoom function is available for examining the plot in detail:
- Hold the Ctrl key, click and hold the left mouse button, then move the cursor across the range you wish to focus.
  - To revert to the default zoom, hold Ctrl, then double-click anywhere in the plot area.
  - You can zoom in repeatedly until you have reached the desired magnification level.
As an alternative to selecting a threshold by left-clicking, you can scroll through all thresholds using the Previous and Next buttons.

Note that as soon as a threshold is defined on a Continuous variable, it is considered Discretized, and the variable's data column is colored in soft blue.

The interactive CDF and PDF plots are similar to the editing functions available under Curve View in the Node Editor.

Workflow Illustration

We re-use the dataset from the previous steps, so we can fast-forward to Step 4 and focus on that step.

Generate a Discretization

While remaining on the Manual Discretization screen, you can also utilize the Generate a Discretization function.

Click on the Generate a Discretization button.
Then, select the Type from the drop-down menu, e.g., the R2-GenOpt algorithm. You have nine algorithms available, i.e., the univariate methods only.

Choose the number of Intervals, e.g., 5.
Set a Minimum Interval Weight, which defines the minimum prior probability of an interval in percent. The default value is 1%.
Note that you can set defaults for the above settings under Main Menu > Window > Preferences > Discretization.

Additionally, there are options for Log Transformation and Isolate Zeros, which we discuss in the context of Automatic Discretization.
Click OK to perform the Discretization.

Workflow Illustration

Transfer the Discretization Thresholds

Select the source variable from which you wish to copy the thresholds.
Click the Transfer the Discretization Thresholds button.
A new window opens up that allows you to select one or more target variables.
Select the target variables.
Click OK.

Workflow Animation

Create a Class for Each Type of Discretization

This checkbox is synchronized across Manual and Automatic Discretization processes.
If checked, BayesiaLab automatically creates Classes for each type of Discretization, i.e., all variables that are discretized with the same algorithm will belong to the same Class.
Note that variables that were discretized manually, even if you used the Generate a Discretization button, will all become members of the Class MANUAL.
You can review the Class memberships in the Class Editor after the data import process is complete.

Load Discretizations

This function allows you to load a Discretization Dictionary with saved Discretization Intervals and Discretization Methods.
This approach is particularly helpful when you repeatedly import datasets with the same variables for which you have already found a suitable discretization.

The following text file illustrates the syntax of a Discretization Dictionary.

Automatic Discretization

Context

Except for Manual, all items in the Type menu represent Automatic Discretization algorithms.

Usage

Selecting a Discretization algorithm applies variable by variable, i.e., you can use a different algorithm for each Continuous variable.
To select a variable, click on the variable header or anywhere inside the column.
You can perform the selection and deselection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Ctrl+A: select all variables in the Data panel. However, selecting all variables is not useful here in Step 4, as there are no actions that can apply to all variable types.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
Click the Select All Continuous button to select all Continuous variables.
- Note that this action will also select any variables which you have already discretized manually. As a result, you may override your previous choices.
- Note that Continuous variables already discretized manually are highlighted in soft blue.

If you do not specify an algorithm for a variable that was not manually discretized either, the default Discretization algorithm with its default settings will be used.
You can set the default Discretization algorithm under Main Menu > Window > Preferences > Discretization. [+] Show More
For the following algorithms, a Log Transformation is available as an option:
- Applying the Log Transformation is useful if you have a high density of values at the bottom end of the variable domain. This "stretches" the scale for small values approaching zero.
- Note that the Log Transformation is only used temporarily for discretization purposes. Thus, the values of the thresholds and values of the intervals can all be interpreted based on the original scale.
For the following algorithms, the option Isolate Zeros is available:
- Separating 0 into a separate interval can be useful for zero-inflated distributions so as to clearly separate small values from "absolutely nothing."
Click Finish to perform the Discretization.
A progress bar displays the status of the Discretization process:

If a Filtered Value is defined for a Continuous variable, a new artificial interval with an infinitesimally small width of 10-7 will be added after the intervals defined in this step. This newly-created state will serve as the Filtered State, and "*", i.e., the asterisk character, will be its State Name.
At its conclusion, BayesiaLab opens up a Graph Window with all imported variables now represented as nodes.

Automatic Discretization Algorithms in Detail

Export Dictionary

Data

Open Data Source

This menu item allows opening the file or the database selector and then starts the Data Import Wizard.

Text *file:* Once the file is read and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available.
Database: Once the database table is loaded and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available.
Recent databases: Keep a list of the recently opened databases. The Data importation wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus .

Associate Data Source

This menu item allows opening the Data association wizard in order to associate data from a text file or a database with an existing Bayesian network.

Recent databases: Keep a list of the recently opened databases. The Data association wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus .

Associate Dictionary

This menu item allows defining the properties of the active Bayesian network thanks to text files. These properties concern arcs, nodes and states:

Arc:
- Arcs: allows associating a set of arcs to the network. The indicated arcs can be added or removed from the network. The arc removal will always be done before adding an arc. Before adding an arc, all the constraints belonging to the Bayesian network as well as the arc constraints and the temporal indices will be checked. If a constraint is not verified, then the arc won't be added.
- Forbidden Arcs: allows associating with the network a set of forbidden arcs .
- Arc Comments: allows associating with the network a set of arc comments .
- Arc Colors: allows associating with the network a set of colors on the arcs.
- Fixed Arcs: allows defining if some arcs are fixed or not.
Node:
- Node Renaming: allows renaming each node with a new name. These new names must be, of course, all different.
- Comments: allows associating a comment with each node that is in the file.
- Classes: allows organizing nodes in subsets called classes . A node can belong to several classes at the same time. These classes allow generalizing some node's properties to the nodes belonging to the same classes. They allow also creating constraints over the arc creation during learning.
- Colors: allows associating colors with the nodes or classes that are in the file. The colors are written as Red Green Blue with 8 bits by channel in hexadecimal format (web format): for example the color red is 255 red 0 green 0 blue, it will give FF0000. Green gives 00FF00, yellow gives FFFF00, etc.
- Images: allows associating colors with the nodes or classes that are in the file. The images are represented by their path relatively to the directory where the dictionary is.
- Costs: allows associating with each node a cost . A node without cost is called not observable.
- Temporal Indices: allows associating temporal indices with the nodes that are in the file. These temporal indexes are used by the BayesiaLab's learning algorithms to take into account any constraints over the probabilistic relations, as for example the no adding arcs between future nodes to past nodes. The rule that is used to add an arc from node N1 to node N2 is:
- If the temporal index of N1 is positive or null, then the arc from N1 to N2 is only possible if the temporal index of N2 is greater of equal to the index of N1.
- Local Structural Coefficients: allows setting the local structural coefficient of each specified node or each node of each specified class.
- State Virtual Numbers: allows setting the state virtual number of each specified node or each node of each specified class.
- Locations: allows setting the position of each node.
State:
- State Renaming: allows renaming each state of each node with a new name.
- State Values: allows associating with each state of each node a numerical value .
- State Long Names: allows associating with each state of each node a long name more explicit than the default state name. This name can be used in the different ways to export a database, in the html reports and in the monitors.
- Filtered States: allows defining a state to each node as a filtered state .

Dictionary File Structures

In order to specifically differenciate a nam which is the same for a classe, a node or a state, you must add at the end of the name the suffix "c" for a class, "n" for a node and "s" for a state.

Export Dictionary

This menu item allows exporting the different kinds of dictionaries in text files.

Associate an Evidence Scenario File

This menu item allows associating an evidence scenario file with the network.

Export an Evidence Scenario File

This menu item allows exporting into a text file an evidence scenario file associated with the network.

Save Data

Imputation

Allows the imputation of the missing values of the associated database according to the mode selected in the following dialog box:

Graphs

Opens the graph editor if a database is associated with the current network.

Step 1 — Data Structure Definition: Text File

Context

Open Data Source (Data Import Wizard) brings data into BayesiaLab to create a new Bayesian network.

BayesiaLab can load data from flat text files (e.g., CSV, TXT) or connected databases.

Usage

In Step 1 — Data Structure Definition: Text File of the five-step Data Import Wizard, you need to define the dataset structure for BayesiaLab so that the data can be imported and interpreted correctly.
In Modeling Mode , select Main Menu > Data > Open Data Source > Text File.
The Data Structure Definition window opens up.
Specify all Settings & Options (see below).
Click Next to proceed the Step 2 — Definition of Variable Types.

Elements in Step 1

Separators
Encoding
Missing Values
Filtered Values
Sampling
Learning/Test
Options

Separators

The Data Import Wizard will attempt to automatically identify the separator or delimiter of the fields in the data table.

However, there can be ambiguous situations in which you need to specify the separator by checking the appropriate box:

Tab
Semicolon
Comma
Space
Other

Encoding

The Encoding drop-down list allows you to select an alternative encoding for the dataset to be imported. This can become necessary for importing data from certain legacy systems.

Missing Values

Specifying the correct code for Missing Values is very important so that BayesiaLab can process such Missing Values appropriately.

Click Add to create a new entry in this list for the current data import.
Clicking Remove deletes the selected entries.

You can set your own default list of codes under Main Menu > Windows > Preferences > Data > Import & Associate > Missing & Filtered Values.

Filtered Values

Just as important as the correct definition of Missing Values is a clear understanding of a Filtered Value.

Click Add to create a new entry in this list for the current data import.
Clicking Remove deletes the selected entries.

You can set your own default list of codes under Main Menu > Windows > Preferences > Data > Import & Associate > Missing & Filtered Values.

In Data Preview, all Filtered Values are marked with an asterisk (*) in the data table.

Understanding the difference between Missing and Filtered Values is critically important.

Sampling

Clicking Define Sample button opens a window that allows you to sample records from your data source.

This is particularly useful for the preliminary analysis of large datasets. By default, BayesiaLab imports all records from the data.

You can define a subset in three ways:

Random Sample — Size in Percent: specify the size of the random sample as a percentage of the original dataset size.
Random Sample — Size: specify the number of records in the sampled dataset.
Custom Range — First Row to Last Row: specify the range of records to be imported.

Learning/Test

By default, the Data Import Wizard loads the entire dataset as a Learning Set.

By clicking the Define Learning/Test Sets button, you can set aside a Test Set (or holdout sample).

You can define the Learning Set/Test Set split in three ways:

Random Test Set — Size in Percent: specify the size of the Test Set as a percentage of the original dataset size.
Random Test Set — Size: specify the number of records in the Test Set.
Custom Test Set — First Row to Last Row: select a specific range of records for a Test Set.

In addition to specifying a Learning Set/Test Set split here, you can define a split in other ways:

You can designate a variable in the original dataset to assign records to the Learning Set and Test Set. You can select such a variable in the next step of the Data Import Wizard: Step 2 — Definition of Variable Types.
Main Menu > Data > Data Set > Generate Learning/Test Split
Right-click on the database icon in the Status Bar and select Generate Learning/Test Split.

Furthermore, you can remove the Learning Set/Test Set split at any time:

Main Menu > Data > Data Set > Remove Learning/Test Split.
Right-click on the database icon in the Status Bar and select Remove Learning/Test Split.

Options

The Options Panel allows you to manage the interpretation of the to-be-imported dataset.

Title Line:
- By checking this option, BayesiaLab reads the first row of the dataset and uses its values as column headers.
- If the values in the first row are not compatible, e.g., due to missing values or duplicate values, you are prompted to accept the proposed corrections, which include adding suffixes for duplicate names and substituting missing values with generic column headers, e.g., N0, N1, N2, etc.
End of Line Character:
- With some files, it may be necessary to specify a certain character so that BayesiaLab can correctly detect the end of a row in a data table.
Consider Identical Consecutive Separators as One:
- Check this box so that if you have multiple consecutive separators of the same type, e.g., “;;;”, the Data Import Wizard will treat them as a single separator.
Consider Different Consecutive Separators as One:
- Check this box so that if you have multiple consecutive separators of any type, e.g., “;,|”, the Data Import Wizard will treat them as a single separator.
Double Quotes:
- Remove
- As String Delimiters
Simple Quotes:
- Remove
- As String Delimiters
Transpose:
- By default, BayesiaLab expects the data source to be arranged in
  - columns corresponding to variables and
  - rows corresponding to samples, records, or observations.
- Checking the Transpose option allows you to accept an alternate format, i.e.,
  - rows corresponding to variables and
  - columns corresponding to samples, records, or observations.
- The transposed format is commonly used in bioinformatics. For instance, variables representing genes — sometimes tens of thousands — are arranged row by row. Observations — sometimes only a few dozen — are placed in columns side by side.

The data table at the bottom of the window provides a preview of how the Data Import Wizard sees and interprets your dataset.

Blank fields indicate a Missing Value.
Asterisks (*) mark Filtered Values. In the dataset shown below, for instance, Filtered Values were assigned to all males and post-menopausal women for the variable Pregnancy Status. For those two groups and for obvious reasons, pregnancy is impossible.
Horizontal and vertical sliders allow you to scroll and view the entire dataset. Alternatively, you can move your mouse's scroll wheel up and down.
If a variable name exceeds the column width, you can click on the divider between column headers and drag it into the desired position. Alternatively, double-click the divider to auto-fit the column width to the variable name.

Workflow Animation

In the following animation, we show a dataset that requires numerous settings to be adjusted for proper import:

The dataset uses the pipe character ("|") as a delimiter.
All fields are enclosed in double quotes.
Multiple, arbitrary codes are used for Missing Values:
- "Refused"
- "unknown"
"Not Applicable" is the code for Filtered Value used in this dataset.

Aggregation of Single Variable

Individual variables can be aggregated manually or automatically in Step 4 of the Data Import Wizard.

To illustrate all related workflows, we use an American auto buyer satisfaction survey containing 42,397 responses. Each record contains attributes of the purchased vehicle, such as make (or brand), model, body style, vehicle segment, number of cylinders, transmission, price paid, self-reported fuel economy, plus hundreds of other variables.

Manual Aggregation

First, we want to manually aggregate all 37 automobile brands that appear in the survey into just two states, i.e., Premium Brands and Non-Premium Brands.

This manual aggregation will be based exclusively on our subjective perception of the auto industry as of 2009, which is when this particular survey was conducted.

Click on the Brand variable in the Data panel.
From the States list on the left, select the values you wish to aggregate using Shift+Click or Ctrl+Click.

Then, click the Aggregate button.
The newly-formed, aggregated state appears in the Aggregates list on the right.

By default, the original values are concatenated using the "+" symbol as a delimiter. An underscore "_" is added as a prefix.
As necessary, you can select more values from the States list and create additional aggregated states.
In the list of Aggregates, you can now replace the automatically-generated state names with more meaningful ones.

You can now proceed to any other variable or click Finish to conclude the Data Import Wizard.

Workflow Animation

Correlation-Aided Manual Aggregation

In addition to the Manual Aggregation described above, BayesiaLab can support you in making the aggregation decisions. For this purpose, BayesiaLab can show how the original values of the to-be-aggregated variable correlate with those of other variables.

Continuing with the previous example, we now perform an aggregation of the same variable, Brand. Now, however, we use each brand's correlation with Price as a guide instead of our judgment.

For the purpose of this demonstration, we have already discretized the Price variable manually into three (arbitrary) intervals using two thresholds, i.e., $25,000 and $45,000.

We now want to use the correlation of each brand with the top interval, i.e., $45,000+, as a measure of its "premium appeal" so that we can reduce the 37 brands into three states, Mainstream, Premium, and Luxury.

For reference, 8.65% of all survey responses reported a vehicle purchase price of $45,000 or higher.

Workflow Instructions

Click on the Brand variable in the Data panel.
Click the Show Correlations box.
Select Target and State.

Review the values shown in the Correlations column. By hovering with your cursor over the Correlation bars in each row, a Tooltip displays the percentage difference of the corresponding row versus the marginal value.
The colored bars show how each value compares to the marginal probability of the selected state of the target. A green-colored bar indicates a probability higher than the marginal probability, and a red bar suggests a lower probability.

Select the states to aggregate using Ctrl+Click.

Once you have selected the values, click the Aggregate button.
The newly aggregated values now appear as a single item in the Aggregates list.

Review the newly aggregated states and, if necessary, assign new names to replace the ones that were generated automatically.
To reverse the aggregation select the aggregated items in the Aggregates list and click Delete.

Workflow Animation

Correlation-Aided Automatic Aggregation

The Correlation-Aided Automatic Aggregation is very similar to the Correlation-Aided Manual Aggregation.

The principal difference is that you don't select your to-be-aggregated values manually but rather specify thresholds that determine the aggregation.

So, the initial steps are analogous to the Correlation-Aided Manual Aggregation.

Click on a Discrete variable in the Data panel.
Click the Show Correlations box.

Select Target and State.
Review the values shown in the Correlations column. By hovering with your cursor over the Correlation bars in each row, a Tooltip displays the percentage difference of the corresponding row versus the marginal value.
The colored bars show how each value compares to the marginal probability of the selected state of the target. A green-colored bar indicates a probability higher than the marginal probability, and a red bar suggests a lower probability.

Now, instead of manually selecting the values you want to aggregate, click the Automatic Aggregation button.
The Automatic Aggregation window opens up.

The colored bar at the top visualizes the percentage differences versus the marginal probability of the selected state of the target.
In our example, there is one brand, Mercury, which had no observations in the $45,000+ interval. As a result, it marks the bottom end of the spectrum, i.e., it is 8.65 percentage points below the marginal probability.
On the other end of the spectrum, Porsche is 83.97 percentage points above the marginal probability.
A default threshold is shown for 0, which is marked by the pink-to-red color change in the bar.
You can manually add thresholds by right-clicking on the bar.
As soon as you add a threshold, a corresponding entry appears in the list below.

Right-clicking again on an existing threshold removes that threshold.
You can move an existing threshold by clicking on it and then dragging it to the desired value.

Also, in the table below the colored bar, you can type in a threshold value.

By clicking OK, you confirm the specified thresholds, and all values in the States list will be aggregated accordingly.
Alternatively, you can click on Generate Aggregates and specify the desired number of intervals.
You obtain a set of aggregation thresholds, which you can further modify or accept by clicking OK.
Now you have a new set of states in the list of Aggregates.

Workflow Animation

Aggregation of Multiple Variables

Context

We use the same auto buyer survey dataset to illustrate the process. In the auto industry, numerous schemes are used to group vehicle types and body styles into so-called segments. Each segment carries a descriptive name, e.g., Compact Car, Full-Size SUV, Minivan, Mid-Size Pickup, Mid-Size Crossover. In our dataset, we have four variables, which each represent such a segmentation scheme. While all these segmentation schemes roughly convey the same information, they differ in their granularity: for instance, variable Segmentation 3 has 23 states; Segmentation 4 has 33. Our objective is now to reduce each one of the segmentation schemes down to three states.

This time, instead of Price, we use the variable MPG - Combined as a target. It represents the survey respondents' estimates of their vehicles' combined fuel economy in miles per gallon (MPG). In other words, we want to create a new aggregation for each segmentation scheme based on fuel economy. Also, the variable MPG - Combined only has two intervals, with one threshold at 22.5. This number has been used in the past as a criterion for so-called "gas guzzlers." So, we are going to use the state <=22.5 as a proxy for poor fuel economy. As a result, we expect each of the existing segments to be "remapped" according to fuel economy.

Workflow

In the Data panel, using Ctrl+Click or Shift+Click, select the variables Segmentation 1, Segmentation 2, Segmentation 3, and Segmentation 4.
This brings up the Multiple Aggregation panel.

Set Target to MPG - Combined, and State to <=22.5.
Set Final Number of States to 3.
Click the Aggregate button to perform the aggregation.
Note that there will be no immediate feedback regarding the results of the aggregation.
Rather, we can only see the results of the aggregation in the Import Report in Step 5 of the Data Import Wizard.
Click Finish to complete Step 4 of the Data Import Wizard.
BayesiaLab opens a new Graph Window with all variables now presented as nodes.
Simultaneously, a prompt comes up offering to display the Import Report.

Click Yes, and the Import Report — featuring all variables, not just the aggregated variables — appears in a new window.

Associate Data Source (Data Association Wizard)

Context

BayesiaLab can load data from flat text files (e.g., CSV, TXT) or connected databases.

Usage

To launch the Data Association Wizard for a data table in a
- text file, select Main Menu > Data > Open Data Source > Text File.
- database, select Main Menu > Data > Open Data Source > Database.

Workflow

Step 1 — Data Structure Definition

See Step 1 of the Data Import Wizard

Step 2 — Definition of Variable Types

See Step 2 of the Data Import Wizard.

Additionally, clicking the Unmatched Columns button displays all the columns in the database that are not in the network.

The Unmatched Columns window allows you to select whether to use or not use the unmatched columns from the new dataset.

Step 3 — Data Selection, Filtering, and Missing Value Processing

Step 4 — Node and Node State Association

This step links the variables in the dataset to the nodes of the network.
As such, this step depends on the three previous steps and the selection of variable types.
Here you can define how the variables in the to-be-associated dataset will be mapped to the nodes already in the network.
The following assignments are possible:
- Discrete variable in the dataset → Discrete node in the network
- Discrete variable in the dataset → Continuous node in the network
- Continuous variable in the dataset → Continuous node in the network
If variables in the dataset have the same name and type as existing nodes in the network, BayesiaLab will automatically propose an association.

Step 5 — Discretization and Aggregation

Step 6 — Data Association Report

Workflow Illustration

You can process in the same way for the continuous node N. You can also select and add several nodes at the same time.

The zone 3 contains the buttons used to add or remove associations.

The zone 4 contains the list of associations. It can contain also added variables from the database that will be treated as new nodes in the network. A double-click on an association display, if necessary, a dialog used to edit a discrete or a continuous association. As you can see, some associations show a warning icon. This icon indicates that some unusual behaviors are present in those associations.

The zone 6 contains three buttons. The first and second buttons allow extending automatically the minimum and maximum of each continuous node that does not fit the database's limits. The third button allows filtering automatically each row that does not fit the network's limits.

Discrete Column Association

When you want to add or edit an association between a discrete column of the database and a discrete or continuous node, a dialog box appears:

The zone 3 contains the buttons to add or remove states' associations.

By default, the database's states which are the same as the network's ones, as the aggregates or as the states' long names will be automatically linked.

If filtered values exist in the database but are not declared in the network, it is possible to merge them with the specific state *, if it exists. In this case, this state will be automatically defined as filtered for each concerned node.

Continuous Column Association

When you want to add or edit an association between a continuous column of the database and a continuous node, a dialog box appears:

This dialog is displayed only if the limits of the variable from the database are outside the limits of the node from the network.

By default, the limits of the node of the network are used and all the values outside these limits will be removed from the database. If you want to keep them, use the corresponding options.

Step 5: Discretization of the Continuous Variables and State Aggregation of the Discrete Variables

This step occurs only when some columns of the database are not linked with nodes of the network but are distributed. These columns will create new nodes in the network and must be discretized if they are continuous and their states can be aggregated if they are discrete.

Same as Step 4 in Data Importation Wizard.

Step 6: Associate Report

The modified nodes table:
- For the discrete nodes, will be indicated, if necessary, the correspondence between the states in the database and in the network.
- For the continuous nodes, will be indicated, if necessary, the initial minimum of the data and the retained final minimum and also the initial maximum and the retained final maximum.
The hidden nodes table: indicates the node that are in the network and that don't have any associated data.
The added nodes table: indicate the list of variables added to the network from the database. This table is the same as in the import report

Associate Dictionary

Context

Dictionaries offer a convenient way to manage a large set of properties related to a Bayesian network using text files with a human-readable syntax.

Dictionaries are plain text files that can be opened and edited outside of BayesiaLab in any text editor.

Using Dictionaries, you can export the properties of a given network or associate properties that you previously saved.

Dictionaries are specific to the elements of a Bayesian network, e.g., Arcs, Nodes, and States and their respective properties.

Usage

To associate a Dictionary for Arc properties, select

Main Menu > Data > Associate Dictionary > Arc >

and then select the property from the submenu:

Dictionary File Structure

In order to specifically differenciate a nam which is the same for a classe, a node or a state, you must add at the end of the name the suffix "c" for a class, "n" for a node and "s" for a state.

If your network contains non-ASCII characters, you must save your own dictionaries with UTF-8 (Unicode) encoding. For example, in MS Excel, choose "save as" and select "Text Unicode (*.txt)" as type of file. In Notepad, choose "save as" and select "UTF-8" as encoding. If your file contains only ASCII character you can let the default encoding (depending on the platform) but it is strongly encouraged to use UTF-8 (Unicode) encoding in order to create dictionary files that doesn't depend on the user's platform. So, for example, a chinese dictionary can be read by a german without any problem whatever the used platforms are. If you are not sure how to save a file with UTF-8 encoding, you should export a dictionary with BayesiaLab, modify and save it (with any text editor) and load it in BayesiaLab.

Arc

To associate a Dictionary for Arc properties, select

Main Menu > Data > Associate Dictionary > Arc >

and then select the property from the submenu:

Arcs
- Specifies the addition or removal of arcs for the currently active Bayesian network. If an arc removal is specified, it will precede any addition of an arc.
- Before adding arcs, any constraints applicable to the active Bayesian network and the Temporal Indices will be checked. If a specified arc addition is inconsistent with the existing constraint, the arc won't be added.
- Syntax Examples:
  - N1->N2=true adds an arc from N1 to N2
  - N1->N2=false removes the arc from N1 to N2
  - N1<-N2=true adds an arc from N2 to N1 (note the reversal of the arrow symbol <- produces an arc in the opposite direction).
  - Note that you need to add an escape character \ before any spaces in node names. Otherwise, a space will be interpreted as a delimiter: N\ 1->Node\ 2=true adds an arc from N 1 to N 2
  - Instead of the -> characters, you can also use space , the equal sign = , and -- as a delimiter between the start node and end node. With these alternative delimiters, the order of the nodes determines the arc direction.
  - N1 N2=true adds an arc from N1 to N1
  - N1=N2=true adds an arc from N1 to N2
  - N1--N2=true adds an arc from N1 to N2
  - N1 N2=true adds an arc from N1 to N2
Forbidden Arcs
- Specifies the addition or removal of Forbidden Arcs between nodes and classes
- Syntax Examples:
  - N1->N2 adds a Forbidden Arc from N1 to N2
  - N1--N2 adds a Forbidden Arc between N1 and N2
  - ClassA->ClassB applies Forbidden Arcs from any nodes in ClassA to any nodes in ClassB
  - N1 N2 removes any existing Forbidden Arc between N1 and N2. Note the space in the syntax, which triggers the removal of the Forbidden Arc.
Arc Comments
- Adds, updates, or removes Arc Comments to arcs in the active network. Arc Comments are stored in HTML format.
- Syntax Examples:
  - N1->N2=<p>This is a sample <b>Arc Comment</b>.</p> adds an Arc Comment to the arc between N1 and N2.
  - N1->N2= removes an existing Arc Comment from the arc between N1 and N2.
  - The added Arc Comment can be edited in the Arc Editor: Arc Contextual Menu > Edit

Arc Colors
- Defines colors for arcs in the active network. You can specify the color for each arc individually by providing the hex code of the color
- Syntax Examples:
  - N1->N2=000000 changes the color of the arc between N1 and N2 to black.
  - N1->N2=FF0000 changes the color of the arc between N1 and N2 to bright red.
  - Note that there is no option to revert an arc color to the default color. When changing Arc Colors via a Dictionary, the colors must always be specified explicitly.
Structural Priors
- Assigns Structural Priors to arcs in the active network.
Fixed Arcs
- Applies Fixed Arcs to the active network or removes them.
- Syntax Example:
  - N1->N2=true changes the arc between N1 and N2 to a Fixed Arc.
  - N1->N2=false changes the arc between N1 and N2 to a normal, non-fixed arc.

Evidence Scenario File

Context

In BayesiaLab, you can manage sets of actual or potential observations in a Bayesian network using Evidence Scenario Files.
For instance, an Evidence Scenario File can serve as a convenient way to manage multiple sets of assumptions, such as what-if scenarios. This is particularly helpful when scenarios contain many individual assumptions. Imagine the business case of an airline represented as a Bayesian network. It would have to include assumptions regarding travel demand for all origin-destination pairs. Manually setting and modifying assumptions for hundreds of nodes would not be practical.

Definition

An Evidence Scenario File consists of one or more Evidence Scenarios.
And, each Evidence Scenario contains one or more node-specific observations, as illustrated below:

Applying an Evidence Scenario means setting the stored pieces of evidence to the corresponding nodes.

Store Evidence Scenarios in Validation Mode

With a given Bayesian network, any current observation on a node or sets of observations set on multiple nodes can be recorded as an Evidence Scenario. As soon as you store an Evidence Scenario, BayesiaLab "starts a tab" by creating an internal Evidence Scenario File.

Four types of evidence can be saved as an Evidence Scenario:

Hard Evidence
Likelihood Evidence
Probabilistic Evidence
Numerical Evidence

To learn more about setting evidence, please see the section on Setting Evidence in Contextual Menu of Monitors.

Usage

Then, enter an optional comment in the pop-up window and assign a Weight to the Evidence Scenario you are storing. If you don't enter a comment, the Evidence Scenario will merely be indexed sequentially, starting with 0.

Click OK to confirm.
You can add further Evidence Scenarios to the ones already stored in the internal Evidence Scenario File.

Upon selecting (and therefore applying) an Evidence Scenario, the corresponding comment, if available, appears in the Status Bar.
Note that an Evidence Scenario File is saved with the Bayesian network file. So, reopening the saved network makes all stored Evidence Scenarios available again.

Inference with an Evidence Scenario File

In addition to recalling Evidence Scenarios one by one, you can also use them in BayesiaLab batch-processing functions:

Batch Labeling
Batch Inference
Batch Joint Probability
Batch Outlier Explanation

In this context, the Evidence Scenario File provides the observations in the same way as an internal or external dataset.

Evidence Scenario File Syntax

As with BayesiaLab's Dictionaries, the syntax of an Evidence Scenario File is straightforward. However, we need to distinguish between the syntax for Contemporaneous and Temporal networks:

Contemporaneous Networks

Each line of an Evidence Scenario File represents one Evidence Scenario.
Encoding an Evidence Scenario always follows the same pattern, with the node name and the evidence separated by a colon (:). The optional scenario name follows after a double slash (//). ?<NodeName>?:<Evidence>//<ScenarioName>
Evidence can be encoded in several ways in an Evidence Scenario File:
- Hard Evidence: ?<NodeA>?:<State1>//Scenario1
- Numerical Evidence: ?<NodeB>?:m{<value>}//Scenario2
- Probabilistic Evidence: ?<NodeC>?:p{<StateA>:0.3;<StateB>:0.5;<StateC>:0.2}//Scenario3
- Likelihood Evidence: ?<NodeD?:l{<StateX>:1;<StateY>:0.5}//Scenario4
To encode multiple pieces of evidence in one Evidence Scenario, simply separate the individual pieces of evidence with a semicolon. The scenario name remains at the end of the line, separated by a double slash.

?<NodeA>?:<State1>;?<NodeB>?:m{<value>} ;?<NodeC>?:p{<StateA>:0.3;<StateB>:0.5;<StateC>:0.2};?<NodeD>?:l{<StateX>:1;<StateY>:0.5}//Scenario5

Temporal Networks

For Temporal Bayesian networks, the syntax of the Evidence Scenario File is slightly different. Here, each line in the text file refers to a time step, in which the evidence specified in that line will be applied.
Each line starts with an integer value that represents the time step, in which the evidence of that line will be set.
Evidence can be encoded in several ways in an Evidence Scenario File:

<TimeStep2>;?<NodeName>?:<Evidence>
<TimeStep4>;?<NodeName>?:<Evidence>
<TimeStep19>;?<NodeName>?:<Evidence>

To encode multiple pieces of evidence in one Time Step, simply separate the individual pieces of evidence with a semicolon.

<TimeStep3>;?<NodeA>?:<State1>;?<NodeB>?:m{<value>} ;?<NodeC>?:p{<StateA>:0.3;<StateB>:0.5;<StateC>:0.2};?<NodeD>?:l{<StateX>:1;<StateY>:0.5}\Evidence for Step 3
<TimeStep5>;?<NodeA>?:<State3>;?<NodeB>?:m{<value>} ;?<NodeC>?:p{<StateA>:0.1;<StateB>:0.7;<StateC>:0.2};?<NodeD>?:l{<StateX>:0.1;<StateY>:0.5}\Evidence for Step 5

For Temporal networks, recalling evidence from the Evidence Scenario File is different compared to Contemporaneous networks.
Now, the time-specific Evidence Scenarios will be set automatically as you perform a temporal simulation.