----------------------- Core Datastore Overview ----------------------- Introduction ------------ The high-level goal of the Mangrove datastore is to allow the free submission of data about a known set of *entities* and the quick and easy retrieval of data aggregated across time and hierarchy *without* requiring any upfront definition of schemas or entity structure. The key goals can be summarized as: * **Support Schema-less submission of arbitrary data** This is motivated by expected usage patterns where an organization will frequently modify the data collected based on actual usage. By avoiding requiring any a-priori definition of data-sets users are given full flexibility to adjust data collected on-the-fly. for example, a health NGO operating rural clinics might begin by simply collecting a monthly report of how many patients where seen in that month. As they get more sophisticated they may start collecting separate values for men, women, girls and boys. This transition should not require any datastore restructuring. * **Support aggregation of data across time and hierarchy (geographic as special case)** Time-based aggregations include queries such as "Average number of patients seen in 2011" or more complex segmented time aggregations such as "Average number of patients seen each month in 2011" The key hierarchical aggregation is by geographic administrative boundaries. For example: "Total number of patients seen in 2011 for all clinics in San Francisco (or California or United States)" Non-geographic arbitrary aggregation trees as supported as well. For example, aggregation by organization chart: "Patients seen at clinics managed by the Child Protection group" * **Provide data consistency on a field level via 'Data Dictionary'** To make it easy for users to aggregate data collect for a given entity via unstructured data submissions, the core datastore will include a 'Data Dictionary' where semantic-types are defined at stored. These types are then applied to submitted data fields allowing aggregation across different submissions and encouraging data consistency. For example, our health NGO now wishes to collecting data on each patient who receives an HIV test so they submit data for each patient test in form (name, age-in-years, test-administered). Later they start recording patients who receive family-planning counseling and collect: (name, age-in-years, counseling-program-attended) When they want to get the average age of patients who received HIV Tests or Family Planning Counseling the system can aggregate values of 'age-in-years' from both submissions even though the structure of each submission is different. And later, when they want to start registering infants seen, they can define a more useful 'Age in Months' field (with values ranging from 0-60) and still run aggregations of the form "Average age of patients seen" by multiplying any aggregated "Age in Years" values by 12 before averaging with "Age in Months" fields. * **Provide simple Python and RESTful APIs for accessing data and standard aggregation queries** The datastore is agnostic as to both the sources and consumers of data. These APIs will allow data sources ranging from SMS engines, to XForms clients and Web applications to submit data. On the visualization and reporting side, charting, plotting, graphing, and geographic visualization clients may access data series suitable for visualization pre-aggregated across time and hierarchy. Core Structures --------------- The logical architecture as envision has very few structures: * **Entity** An 'entity' is anything that users may want to report on. For example: a patient, a clinic, a waterpoint, etc... Entities are *typed* (e.g. 'Clinic', 'Waterpoint') and *uniquely identified* Entities contain *no data* beyond **UID** and **TYPE** Entities must be *registered* in the system before data can be collected on them. Registration is nothing more than the process of assigning a UID to the entity and does *not* have to be a distinct user-action—the datastore can register an entity as part of the process of recording the first submission of data on the entity. * **Data Record** Every time data is submitted to the datastore it is saved as an independent time-stamped data record. Each data record is associated with a single Entity. The set of data records for a given Entity comprises all the values/data known about that Entity. For example, if a user submits a report that 10 patients were seen in May at Clinic1, and other user submits a report that Clinic1 had stock of 20 bednets in May, the set of information known about Clinic1 is that in May 10 patients were seen and 20 bednets are in stock. * **Fields and Values** Each data record contains an arbitrary set of field/value tuples with fields optionally typed from the Data Dictionary. * **Data Dictionary Types** These are definitions of types which can be associated with fields in a data record. Defined types maybe contain the following: * Type name * Base type (numeric, string, choice, geocode etc...) * User readable description * Validation constraints Questions we want to ask the Data Store --------------------------------------- Rather than set out specific technical proposals, or get caught in the argument over what should be done in the DB vs. in application logic, here I try to categorize the different kinds of questions we want to be able to ask the data store. For the examples, assume the datastore is holding information for a NGO that operates health clinics throughout the United State. Basic Retrieval +++++++++++++++ **Question** Retrieve all the Entities of a specific type. **Example** Show a list of all health clinics. -------- **Question** Retrieve specific entity by a unique id. **Example** Show health clinic with ID Clinic001: -------- **Question** Retrieve specific entity by a semi-unique id. This may return a list if there are multiple matches. **Example** Show health clinic with "Free Clinic" in its name. State Queries +++++++++++++ **Question** Retrieve an Entity (or set of Entities) with a specific set of values. **Example** Show a list of all health clinics and include with each clinic: * Geographic location * Clinic Directors Name * Current stock of Cipro (an antibiotic) -------- **Question** Return an Entity (or set of Entities) with *all* the latest values associated with it. **Example** Show the latest information for Clinic001. This should include the latest reported value of every field every reported on this clinic. -------- **Question** Retrieve an Entity (or set of Entities) a set of values *as of a given date* **Example** Show all the latest information on Clinic001 as of Jan 15, 2010 Time Aggregated Queries +++++++++++++++++++++++ **Question** Retrieve an Entity (or set of Entities) with a specific set of values *aggregated* by a function such as ``sum()`` or ``avg()`` over a given time range. **Example** Show a list of all health clinics and include with each clinic: * Total number of patients seen in 2011 -------- **Question** Retrieve an Entity (or set of Entities) with a specific set of values *aggregated* by a function such as ``sum()`` or ``avg()`` over a given time range with a given *periodicity*. **Example** Show a list of all health clinics and include with each clinic: * Average number of patients seen each *month* for each month in 2011 Selection Queries +++++++++++++++++ **Question** Retrieve all Entities which have a specific value. **Example** Show all health clinics where "Population Served" > 1000 -------- **Question** Retrieve all Entities which have a specific *aggregated* value. **Example** Show all health clinics where "Total Patients Seen" > 1000 -------- **Question** Retrieve all Entities which have a specific aggregated value over time. **Example** Show all health clinics where "Total Patients Seen in 2011" > 1000 Hierarchy Aggregated Queries ++++++++++++++++++++++++++++ **Note**: These queries don't return entities, they return values aggregated by a hierarchy node (e.g. 'California' or 'San Francisco') which suggests that maybe Matt Berg is right and hierarchy nodes maybe should be consider 'Entities', or 'Generated Entities'... **Question** Retrieve a set of *Values* aggregated by a given *node* in a hierarchy. **Example** From the set of all clinics in California show: * Total number of patients seen in 2011 (in California) * Average number of patients seen in 2011 (in California) -------- **Question** Retrieve a set of *Values* aggregated by a given *level* in a hierarchy. **Example** From each State in the United States show: * Total number of patients seen in clinics in that state 2011 * Average number of patients seen in clinics in that state in 2011