Core Datastore Overview¶

Introduction¶

The high-level goal of the Mangrove datastore is to allow the free submission of data about a known set of entities and the quick and easy retrieval of data aggregated across time and hierarchy without requiring any upfront definition of schemas or entity structure.

The key goals can be summarized as:

Support Schema-less submission of arbitrary data

This is motivated by expected usage patterns where an organization will frequently modify the data collected based on actual usage. By avoiding requiring any a-priori definition of data-sets users are given full flexibility to adjust data collected on-the-fly.

for example, a health NGO operating rural clinics might begin by simply collecting a monthly report of how many patients where seen in that month. As they get more sophisticated they may start collecting separate values for men, women, girls and boys. This transition should not require any datastore restructuring.
Support aggregation of data across time and hierarchy (geographic as special case)

Time-based aggregations include queries such as “Average number of patients seen in 2011” or more complex segmented time aggregations such as “Average number of patients seen each month in 2011”

The key hierarchical aggregation is by geographic administrative boundaries. For example: “Total number of patients seen in 2011 for all clinics in San Francisco (or California or United States)”

Non-geographic arbitrary aggregation trees as supported as well. For example, aggregation by organization chart: “Patients seen at clinics managed by the Child Protection group”
Provide data consistency on a field level via ‘Data Dictionary’

To make it easy for users to aggregate data collect for a given entity via unstructured data submissions, the core datastore will include a ‘Data Dictionary’ where semantic-types are defined at stored. These types are then applied to submitted data fields allowing aggregation across different submissions and encouraging data consistency.

For example, our health NGO now wishes to collecting data on each patient who receives an HIV test so they submit data for each patient test in form (name, age-in-years, test-administered).

Later they start recording patients who receive family-planning counseling and collect: (name, age-in-years, counseling-program-attended)

When they want to get the average age of patients who received HIV Tests or Family Planning Counseling the system can aggregate values of ‘age-in-years’ from both submissions even though the structure of each submission is different.

And later, when they want to start registering infants seen, they can define a more useful ‘Age in Months’ field (with values ranging from 0-60) and still run aggregations of the form “Average age of patients seen” by multiplying any aggregated “Age in Years” values by 12 before averaging with “Age in Months” fields.
Provide simple Python and RESTful APIs for accessing data and standard aggregation queries

The datastore is agnostic as to both the sources and consumers of data. These APIs will allow data sources ranging from SMS engines, to XForms clients and Web applications to submit data.

On the visualization and reporting side, charting, plotting, graphing, and geographic visualization clients may access data series suitable for visualization pre-aggregated across time and hierarchy.

Core Structures¶

The logical architecture as envision has very few structures:

Entity

An ‘entity’ is anything that users may want to report on. For example: a patient, a clinic, a waterpoint, etc... Entities are typed (e.g. ‘Clinic’, ‘Waterpoint’) and uniquely identified

Entities contain no data beyond UID and TYPE

Entities must be registered in the system before data can be collected on them. Registration is nothing more than the process of assigning a UID to the entity and does not have to be a distinct user-action—the datastore can register an entity as part of the process of recording the first submission of data on the entity.
Data Record

Every time data is submitted to the datastore it is saved as an independent time-stamped data record.

Each data record is associated with a single Entity. The set of data records for a given Entity comprises all the values/data known about that Entity.

For example, if a user submits a report that 10 patients were seen in May at Clinic1, and other user submits a report that Clinic1 had stock of 20 bednets in May, the set of information known about Clinic1 is that in May 10 patients were seen and 20 bednets are in stock.
Fields and Values

Each data record contains an arbitrary set of field/value tuples with fields optionally typed from the Data Dictionary.
Data Dictionary Types

These are definitions of types which can be associated with fields in a data record. Defined types maybe contain the following:
- Type name
- Base type (numeric, string, choice, geocode etc...)
- User readable description
- Validation constraints

Questions we want to ask the Data Store¶

Rather than set out specific technical proposals, or get caught in the argument over what should be done in the DB vs. in application logic, here I try to categorize the different kinds of questions we want to be able to ask the data store.

For the examples, assume the datastore is holding information for a NGO that operates health clinics throughout the United State.

Basic Retrieval¶

Question: Retrieve all the Entities of a specific type.
Example: Show a list of all health clinics.

Question: Retrieve specific entity by a unique id.
Example: Show health clinic with ID Clinic001:

Question: Retrieve specific entity by a semi-unique id. This may return a list if there are multiple matches.
Example: Show health clinic with “Free Clinic” in its name.

State Queries¶

Question

Retrieve an Entity (or set of Entities) with a specific set of values.

Example

Show a list of all health clinics and include with each clinic:

Geographic location
Clinic Directors Name
Current stock of Cipro (an antibiotic)

Question: Return an Entity (or set of Entities) with all the latest values associated with it.
Example: Show the latest information for Clinic001. This should include the latest reported value of every field every reported on this clinic.

Question: Retrieve an Entity (or set of Entities) a set of values as of a given date
Example: Show all the latest information on Clinic001 as of Jan 15, 2010

Time Aggregated Queries¶

Question

Retrieve an Entity (or set of Entities) with a specific set of values aggregated by a function such as sum() or avg() over a given time range.

Example

Show a list of all health clinics and include with each clinic:

Total number of patients seen in 2011

Question

Retrieve an Entity (or set of Entities) with a specific set of values aggregated by a function such as sum() or avg() over a given time range with a given periodicity.

Example

Show a list of all health clinics and include with each clinic:

Average number of patients seen each month for each month in 2011

Selection Queries¶

Question: Retrieve all Entities which have a specific value.
Example: Show all health clinics where “Population Served” > 1000

Question: Retrieve all Entities which have a specific aggregated value.
Example: Show all health clinics where “Total Patients Seen” > 1000

Question: Retrieve all Entities which have a specific aggregated value over time.
Example: Show all health clinics where “Total Patients Seen in 2011” > 1000

Hierarchy Aggregated Queries¶

Note: These queries don’t return entities, they return values aggregated by a hierarchy node (e.g. ‘California’ or ‘San Francisco’) which suggests that maybe Matt Berg is right and hierarchy nodes maybe should be consider ‘Entities’, or ‘Generated Entities’...

Question

Retrieve a set of Values aggregated by a given node in a hierarchy.

Example

From the set of all clinics in California show:

Total number of patients seen in 2011 (in California)
Average number of patients seen in 2011 (in California)

Question

Retrieve a set of Values aggregated by a given level in a hierarchy.

Example

From each State in the United States show:

Total number of patients seen in clinics in that state 2011
Average number of patients seen in clinics in that state in 2011

Core Datastore Overview¶

Introduction¶

Core Structures¶

Questions we want to ask the Data Store¶

Basic Retrieval¶

State Queries¶

Time Aggregated Queries¶

Selection Queries¶

Hierarchy Aggregated Queries¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Core Datastore Overview¶

Introduction¶

Core Structures¶

Questions we want to ask the Data Store¶

Basic Retrieval¶

State Queries¶

Time Aggregated Queries¶

Selection Queries¶

Hierarchy Aggregated Queries¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation