A lot of data is tabular in nature, and is efficiently encoded in text files. While such files are easy to produce and read, they bring with them several challenges when used in visualization tools and other programs that have to understand some of the data’s properties. Examples include categorical data, special values in numerical columns (which are common in Census data), and information about the data like its producer. Here is a proposal for a simple data description format that provides that missing information. I call it qnch.
The goal of this language is to provide all the necessary information that is commonly needed when parsing a data file and trying to find more information about it. The goal is not to cover every use case that could possibly exist, nor to create elaborate taxonomies of data types. There are other efforts for that, and I have no desire to compete with them. qnch is meant to be simple, readable, easy to implement and use, yet versatile and useful.
Information about data files is usually given in accompanying text or PDF files, in a way that cannot be parsed automatically. This makes the processing of the data for visualization and other purposes much more of a challenge than necessary.
While it may make sense to figure out a way to include all the additional information in the actual data file, it’s not realistic to expect all data producers to switch to such a format, or all data processing and visualization programs to be able to read it. A separate file does not require the original data to be altered at all, and can be produced either by the data source directly or by a third party. I can even imagine setting up a registry that you can give the URL of a data file and that points to where you can find a third-party qnch file.
qnch is written using a structured format called YAML. This was chosen on purpose to make the file human-readable (and also -writable) without all the clutter and complexity of XML. An equivalent XML implementation would consist mostly of markup without any benefits. YAML is a very clean format that is very similar in design to JSON (and is, in fact, a true superset of JSON), and thus includes useful semantic features like lists and key-value pairs.
Below is a discussion of the different features and some examples that show the usage in actual qnch syntax. This is not a complete specification, but it contains enough information to get the discussion going about additional features and other ideas.
Basic information for the Parser, Format
The parser needs to know a few things about the file to parse it. In addition to the encoding, the type is specified here. For delimited files, most of the information for the parser is specified at the beginning of the file. Character numbers for fixed-column formats are given as part of the dimension definitions.
encoding (optional). What encoding is used for the data file? This is important for textual data and perhaps categories, since numbers are rather immune to encoding changes. The default is utf-8, which conveniently includes ASCII as a subset. Other ISO encodings can be specified here, as well. Others, such as non-Unicode Asian encodings, or exotic encodings like EBCDIC are explicitly not supported. Unicode is simply the way to go.
type (mandatory). This can be
fixed. Delimited means that there is a delimiter character that separates the fields on each line. Fixed means that the values are located at specific character indices on each line. There are two shortcuts here, csv and tsv. These specify a delimiter of a comma and tab, respectively.
delimiter (optional). Given for delimited files, this is the character that separates values on each line. If not given for type fixed, defaults to a comma. Any character can be used here.
quotation-character (optional). For delimited files, this is used to enclose fields whose values contain the delimiter character. This defaults to double quotes.
headers (optional). For a delimited file, is there a header line? Value can be
false, default is
false, the sequence in which dimensions are specified is considered the sequence of dimensions on each line.
strict (optional). Whether the parser should throw errors when an unspecified value is encountered. The default is
false. If set to
true, this will throw errors when there are columns that were not defined, categories that were not specified, or when a numerical value is outside the range specified by the minimum and maximum fields below.
This is the place for human-readable information about the dataset. The list below is what seems to be useful for most cases, but there is clearly room for more. The source field therefore can contain additional pieces of information that are not specified here, but that will be shown to the user if present in a qnch file.
- name (optional). The name of the dataset. If not provided, the name of the data file is used.
- description (optional). A longer description of the dataset.
- date (optional). The date of the data set, if applicable. This field can contain any level of precision, from only a year down to a time stamp that is precise to the second.
- source (optional). Information about the data source. This field has a number of named sub-fields, specified below.
- organization (optional). The name of the publisher of the data set.
- organization-url (optional). A URL identifying the publisher.
- contact (optional). The name and email address of a person you can talk to about the data.
- info-url (optional). A URL pointing to the page where the dataset can be found.
- data-url (optional). A URL pointing to the canonical location of the data file itself.
- citation (optional). How to cite this dataset in a publication.
- Sub-fields that are not named above are acceptable, but the program may not be able to do more with them than display them in a list.
- row-pattern (optional). A regular expression that is matched against each line before it is broken up into fields to decide whether to consider it or not. This is useful for U.S. Census data where housing and person records are mixed in the same file. In this case, you need to specify a separate qnch file for each type of record in the dataset.
- dimensions (optional). The list of columns, or data dimensions, in the file using the format described below. This list can be left out if all that is desired is the metadata for the entire file.
Using the U.S. Census Public-Use Microdata Sample (PUMS) as an example, here is what the complete header of a qnch file could look like that applies to the household records (indicated by the letter “H” at the start of the row):
name: U.S. Census 2000, Households type: fixed source: organization: Census Bureau organization-url: http://census.gov/ info-url: http://www.census.gov/main/www/pums.html downloaded: 2009-08-10 row-pattern: ^H strict: true
Each column has data associated with it that depends on its type and on the type of file. The first list are items that both numerical and categorical columns share. Columns can be specified in any order for delimited files with header and for fixed column files; they need to be specified in the order they appear in for delimited files without header.
- name (mandatory). This is the name of the dimension. If no id is given, this also serves as the id that is used in delimited files with header lines to find the right column. This is usually a more human-readable version of the id.
- id (optional). Used to find the column in delimited files with header.
- description (optional). An additional, longer description of the dimension.
- variable-name (optional). This is mostly inspired by the U.S. Census data: provide an additional name that is used as a variable name to hold the value, like in a statistical package.
- type (optional). The type of data in this column:
text. The default is
numerical. Numerical and categorical are specified further below, but text is simply treated as strings that can only be shown as textual information. That means, no further processing is performed, and no aggregates are created. This is useful for place names, product descriptions, and other text that is never used for actual analysis.
- characters (mandatory for fixed column format). The range of characters that contains the data for this data dimension. Both ends are inclusive and must be specified (even if the field is only one character long), counting starts at 1. For a value that starts in the fifth column and whose field is four characters long, this would be: 5-8
- categories (optional). For a categorical dimensions, this field contains a list of category definitions as described below.
Numerical columns have the following additional fields that can be specified.
- minimum, maximum (optional). These can be used to scale charts or to show to the user before the actual data is loaded in. The parser must not rely on these values, though, and will accept values outside this range unless strict is set to true.
- special-values (optional). Define strings that are treated as special values in this numerical dimension. The values are specified like categories below. Values are matched as strings before the field is parsed as a number, so special values do not have to be (but can be) valid numbers. This is important to be able to differentiate between 0, 00, 000, 0-0, and other values that can be found in Census data.
- precision (optional). The smallest difference that the acquisition method used can resolve. If not given, this is assumed to be the smallest difference a standard float can encode. If specified, it can be used to show the granularity of the data in a visualization.
- unit (optional). The unit this number is measured in, the default is no unit. Programs can, but do not have to, offer to convert numbers between units. Given the vast number of units out there, we would need to severely limit the allowed units to make this a requirement, though.
Again using the U.S. Census, here is a column definition for a simple numerical column:
- name: Housing unit weight variable-name: HWEIGHT characters: 102-105 minimum: 0 maximum: 1975
Almost all numerical columns have special values. Here is the number of bedrooms:
- name: Number of Bedrooms variable-name: BEDRMS type: numerical characters: 124-124 special-values: <blank>: Not in universe 0: No bedrooms 5: 5 or more bedrooms
Categories are specified as a list of key-value pairs, with the key being the name found in the file, and the value being the human-readable description. This is the same for special values in numerical dimensions. There is one special value,
<blank>, that describes an empty field. Such fields can exist in both delimited and fixed formats, and usually mean “missing” or “not in universe”.
Here is a definition for the categorical heating fuel field in the Census data:
- name: Heating Fuel variable-name: FUEL type: categorical characters: 132-132 categories: <blank>: Not in universe/unknown 1: Gas from pipes 2: Gas from tank, bottles, LP 3: Electricity 4: Fuel oil, kerosene, etc. 5: Coal or coke 6: Wood 7: Solar energy 8: Other fuel 9: No fuel used
Matching Data and Meta-Data
Matching a CSV or other data file with its qnch file is problematic, because they may have come from different sources, file names can change, and the data file does not contain any meta data to match against (like its canonical URL). The obvious convention is to name the qnch file the same as the data file, but with the extension .qnch. When opening a data file, the parser will look for that file by replacing the extension (or adding it, when the data file has no extension).
A similar convention is to look for underscore characters in the data filename and remove the last part the file name consisting of that underscore and the rest before replacing the extension. This would shorten the file named HouseholdsNC.txt to Households.txt, which would then match Households.qnch. This is only done when no HouseholdsNC.qnch file is found.
If none of these conventions turn up a qnch file, the program should ask the user to pick one, or otherwise offer to import the data based on best guesses.
I will incorporate a qnch parser into the next version of Parallel Sets. Eventually, I will separate the parser and the Data Wizard out of the ParSets program and make it a separate project. qnch files for all datasets available for download through Parallel Sets will also be made available. I hope that based on my implementation and the description here, others will contribute qnch files, parsers, and producers.
So now it’s your turn! Let me know what you think, what’s missing, etc.!