Rons Data Stream Concepts
- Data Cleaners
- Process Job
- Quick Job
Rons Data Stream is built around two core concepts, Cleaners and Jobs.
Cleaners contain a list of instructions that tell the application what actions to perform on a table of data (for example CSV), and can be shared between Jobs.
Jobs contain information about the actual data tables themselves, like where they are and how to read them. Jobs files can be located relative to the data files they process, so that simply double-clicking the Job and hitting Process makes processing 100’s of files intuitive.
If data needs to be processed ad-hoc there is a Quick Job option to process a file or directory quickly.
Jobs and Cleaners can be saved, renamed and used again later which becomes a great time saving for future data processing jobs.
Cleaners are the core of Rons Data Stream (and its namesake). They contain a list of actions, or rules, that describe what the Cleaner is going to do to any particular data source.
In order to start cleaning or amending files, a Cleaner must be created, and cleaning rules added to it. The categories of rules are as follows:
- Columns Selectors
Each Column Selector rule defines criterion to select one or more columns. Column Selectors are used throughout a Cleaner to determine which columns a rule operates on.
- Rows Selectors
Each Row Selector rule defines criterion to select one or more rows. Row Selectors are also used throughout a Cleaner to determine which rows a rule operates on.
- Column Processors
Column operations define what actions are to be performed on the columns, and most require Column Selectors. For example, delete or merge columns.
- Row Processors
Row operations define what actions are to be performed on the rows, and all require Row Selectors, and some Column Selectors. For example, delete or duplicate rows.
- Cell Processors
Cell operations define what actions are to be performed on each cell, and all require Row Selectors, and Column Selectors. For example, adding row number to a cell.
Cleaner Rules Processing
Cleaner rules are processed in the following order:
- Cell rules that apply to column names
Any cell rules that have the Row Selector of type 'Header Row'. Subsequent rules that use column names will use the newly cleaned column names.
- Column Operations
Columns are processed once, before the body data, to establish the shape of the output data.
- Row Operations
Rows are added or removed from the processing pipeline and passed to the cell operations.
- Cell Operations
All cell rules that do not have the Row Selector of type 'Header Row' are applied.
The aim of Rons Data Stream is to shave hours off our customers day by allowing one-click data processing, whilst drinking coffee. The Cleaners describe what happens to the data, so a way of describing which data to apply the Cleaner to was necessary. The Job does that.
Jobs contain a list of data sources, a Cleaner, and a list of outputs, which is all the information needed to process data with a button press. Multiple sources or outputs can be selecting allowing the processing of multiple files at the same time.
There are five sections need to be set up with the necessary information to run the job:
- Source Containers
Source Containers do exactly as the name implies: contain data sources. Currently there is only one type of container, a directory containing files, but in the future source containers will include Azure storage and various types of database.
- Source Profiles
Source Profiles contain information about how files in the source containers are to be processed. Typically they contain a file (like '*.csv') to select files in the container, and information about how to read them. For example CSV files need to have the type of delimiters to use in order to be read.
- Output Containers
Similar to Source Containers, Output Containers do exactly as the name implies: contain the result of file processing.
- Output Loggers
When processing large data source it can be difficult to spot errors, so loggers can be configured and associated with output formatters. Output Loggers require an Output Container, which can be separate from the data Output Container(s).
- Output Formatters
Output Formatters determine the format of the data that is written after it has been processed. Output Formatters require an Output Container.
Processing Jobs and Preview
After a job has been configured, it can be easily run at the click of a button or previewed to see the result of the rules before execution. Processing runs in parallel.
For less complicated processing, where the creation of a Job seems a little excessive, Quick Job can be used to simply specify the elements needed to clean some files:
- Source directory or file
- Source format
- Data Cleaner
- Destination directory
- Logger (optional)
- Output format