You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

datasetSchema.md 29 kB

first commit Former-commit-id: 08bc23ba02cffbce3cf63962390a65459a132e48 [formerly 0795edd4834b9b7dc66db8d10d4cbaf42bbf82cb] [formerly b5010b42541add7e2ea2578bf2da537efc457757 [formerly a7ca09c2c34c4fc8b3d8e01fcfa08eeeb2cae99d]] [formerly 615058473a2177ca5b89e9edbb797f4c2a59c7e5 [formerly 743d8dfc6843c4c205051a8ab309fbb2116c895e] [formerly bb0ea98b1e14154ef464e2f7a16738705894e54b [formerly 960a69da74b81ef8093820e003f2d6c59a34974c]]] [formerly 2fa3be52c1b44665bc81a7cc7d4cea4bbf0d91d5 [formerly 2054589f0898627e0a17132fd9d4cc78efc91867] [formerly 3b53730e8a895e803dfdd6ca72bc05e17a4164c1 [formerly 8a2fa8ab7baf6686d21af1f322df46fd58c60e69]] [formerly 87d1e3a07a19d03c7d7c94d93ab4fa9f58dada7c [formerly f331916385a5afac1234854ee8d7f160f34b668f] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18 [formerly 386086f05aa9487f65bce2ee54438acbdce57650]]]] Former-commit-id: a00aed8c934a6460c4d9ac902b9a74a3d6864697 [formerly 26fdeca29c2f07916d837883983ca2982056c78e] [formerly 0e3170d41a2f99ecf5c918183d361d4399d793bf [formerly 3c12ad4c88ac5192e0f5606ac0d88dd5bf8602dc]] [formerly d5894f84f2fd2e77a6913efdc5ae388cf1be0495 [formerly ad3e7bc670ff92c992730d29c9d3aa1598d844e8] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18]] Former-commit-id: 3c19c9fae64f6106415fbc948a4dc613b9ee12f8 [formerly 467ddc0549c74bb007e8f01773bb6dc9103b417d] [formerly 5fa518345d958e2760e443b366883295de6d991c [formerly 3530e130b9fdb7280f638dbc2e785d2165ba82aa]] Former-commit-id: 9f5d473d42a435ec0d60149939d09be1acc25d92 [formerly be0b25c4ec2cde052a041baf0e11f774a158105d] Former-commit-id: 9eca71cb73ba9edccd70ac06a3b636b8d4093b04
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590
  1. # Dataset Schema (version 4.0.0)
  2. Dataset schema provides a specification of an abstract dataset. It is contained in the [datasetSchema.json](../schemas/datasetSchema.json) file. An instance of this dataset schema is included with every dataset in the datasetDoc.json file. The semantics of datasetSchema.json and datasetDoc.json is [here](FAQ.md#semantics).
  3. Dataset schema specifies a dataset in three sections: about, dataResources, and qualities. Each of these sections are described below.
  4. ## About
  5. The "about" section contains of some general information about the dataset and consists of the following fields.
  6. | Field | Description |
  7. |-----------------------|---------------------------------------------------------------------------------------------------|
  8. | datasetID | a unique ID assigned to a dataset |
  9. | datasetName | the name of a dataset |
  10. | datasetURI | the location of the dataset |
  11. | description | a brief description of the dataset |
  12. | citation | citation of the source of the dataset |
  13. | humanSubjectsResearch | indicates if the dataset contains human subjects data or not |
  14. | license | license of the source of the dataset |
  15. | source | source of the dataset |
  16. | sourceURI | location of the source of the dataset |
  17. | approximateSize | size of the dataset |
  18. | applicationDomain | application domain of the dataset (e.g., medical, environment, transportation, agriculture, etc.) |
  19. | datasetVersion | version of the current dataset |
  20. | datasetSchemaVersion | version of the datasetSchema |
  21. | redacted | a field indicating if the dataset has been redacted or not |
  22. | publicationDate | publication date of the dataset, if available |
  23. ## Data Resources
  24. The "datasetResources" section annotates all the data resources in a dataset. A dataset is considered as a set of data resources. Therefore the "datasetResources" field is a list of dictionaries, one item per resource. Each dictionary item contains the following information about a data resource.
  25. | Field | Description |
  26. |-----------------------|-------------------------------------------------------------------------------------------------------------------|
  27. | resID | a unique ID assigned to a resource |
  28. | resPath | path of this resource relative to the dataset root |
  29. | resType | the type of this resource<sup>+</sup> |
  30. | resFormat | a dictionary structure containing the occurring media types in this dataset as the keys and their corresponding list of file formats as the values <sup>++</sup>|
  31. | isCollection | a boolean flag indicating if this resource if a single item/file or a collection of items/files <sup>+++</sup>. |
  32. <sup>+</sup> __resType__ can be one of the following: __["image","video","audio","speech","text","graph","edgeList","table","timeseries","raw"]__
  33. If `resType` is `edgeList`, the resource is basically a graph, but it is represented using an edge list. Each item in this list refers to an edge given by its source node and its target node (in that order, if it is a directed graph). For example, refer to [Case 5](#case-5) below. The edge list can also have optional edge attributes: [edgeID, source_node, target_node, edge_attr1, edge_attr2, ....]
  34. <sup>++</sup> Here is an example of `resFormat`
  35. ```
  36. "resFormat":{
  37. "image/jpeg":["jpeg", "jpg"],
  38. "image/png":["png"]
  39. }
  40. ```
  41. The [supportedResourceTypesFormats.json](supportedResourceTypesFormats.json) file contains the current list of all the resTypes, resFormats and the file types used in the datasets in a machine-readable way.
  42. <sup>+++</sup>A **collection** of resources/files can be organized into directories and sub-directories.
  43. If `resType` is "table" or "timeseries" (which is also a table basically), columns in the table can be annotated with the following information.
  44. Not all columns are necessary annotated (see [minimal metadata datasets](./minimalMetadata.md) for an example).
  45. There is an optional `columnsCount` field of the resource which can convey how many columns there are,
  46. which can be used to determine if all columns are annotated.
  47. | Field | Description |
  48. |-----------------------|-------------------------------------------------------------------------------|
  49. | colIndex | index of the column |
  50. | colName | name of the column |
  51. | colDescription | description of a column |
  52. | colType | the data type of this column |
  53. | role | the role this column plays |
  54. | refersTo | an optional reference to another resource if applicable |
  55. | timeGranularity | if a particular column represents time axis, this field captures the time sampling granularity as a numeric `value` and `unit` (e.g., 15-second granularity). |
  56. A column's type can be __one of__ the following:
  57. | Type | Description |
  58. |----------------------|-------------------------------------------------------------------------------|
  59. | boolean ||
  60. | integer ||
  61. | real ||
  62. | string ||
  63. | categorical ||
  64. | dateTime ||
  65. | realVector | This is stored as a string in CSV files. It's represented as "val1,val2,vl3,val4,...". For example, in the objectDetection task, the bounding boundary is a rectangle, which is represented as: "0,160,156,182,276,302,18,431". |
  66. | json ||
  67. | geojson | See here: http://geojson.org/ |
  68. | unknown | Column type is unknown. |
  69. A column's role can be __one or more__ of the following:
  70. | Role | Description |
  71. |-------------------------|-----------------|
  72. | index | Primary index of the table. Should have unique values without missing values |
  73. | multiIndex | Primary index of the table. Values in this primary index are not necessary unique to allow the same row to be repeated multiple times (useful for certain task types like multiLabel and objectDetection) |
  74. | key | Any column that satisfies the uniqueness constraint can splay the role of a key. A table can have many keys. Values can be missing in such column. A key can be referenced in other tables (foreign key)|
  75. | attribute | The column contains an input feature or attribute that should be used for analysis |
  76. | suggestedTarget | The column is a potential target variable for a problem. If a problem chooses not to employ it as a target, by default, it will also have an attribute role |
  77. | timeIndicator | Entries in this column are time entries |
  78. | locationIndicator | Entries in this column correspond to physical locations |
  79. | boundaryIndicator | Entries in this column indicated boundary (e.g., start/stop in audio files, bounding-boxes in an image files, etc.) |
  80. | interval | Values in this column represents an interval and will contain a "realVector" of two values|
  81. | instanceWeight | This is a weight assigned to the row during training and testing. It is used when a single row or pattern should be weighted more or less than others.|
  82. | boundingPolygon | This is a type of boundary indicator representing geometric shapes. It's type is "realVector". It contains two coordinates for each polygon vertex - always even number of elements in string. The vertices are ordered counter-clockwise. For e.g., a bounding box in objectDetection consisting of 4 vertices and 8 coordinates: "0,160,156,182,276,302,18,431" |
  83. | suggestedPrivilegedData | The column is potentially unavailable variable during testing for a problem. If a problem chooses not to employ it as a privileged data, by default, it will also have an attribute role |
  84. | suggestedGroupingKey | A column with this role is a potential candidate for group_by operation (rows having the same values in this column should be grouped togehter). Multiple columns with this role can be grouped to obtain a hierarchical multi-index data. |
  85. | edgeSource | Entries in this column are source nodes of edges in a graph |
  86. | directedEdgeSource | Entries in this column are source nodes of directed edges in a graph |
  87. | undirectedEdgeSource | Entries in this column are source nodes of undirected edges in a graph |
  88. | multiEdgeSource | Entries in this column are source nodes of edges in a multi graph |
  89. | simpleEdgeSource | Entries in this column are source nodes of edges in a simple graph |
  90. | edgeTarget | Entries in this column are target nodes of edges in a graph |
  91. | directedEdgeTarget | Entries in this column are target nodes of directed edges in a graph |
  92. | undirectedEdgeTarget | Entries in this column are target nodes of undirected edges in a graph |
  93. | multiEdgeTarget | Entries in this column are target nodes of edges in a multi graph |
  94. | simpleEdgeTarget | Entries in this column are target nodes of edges in a simple graph |
  95. A time granularity unit can be __one of__ the following:
  96. | Unit | Description |
  97. |----------------------|-------------------------------------------------------------------------------|
  98. | seconds ||
  99. | minutes ||
  100. | days ||
  101. | weeks ||
  102. | months ||
  103. | years ||
  104. | unspecified ||
  105. __Notes regarding index, multi-index roles__
  106. A table can have only one `index` or `multiIndex` column.
  107. For a table with `multiIndex` column, for a particular index value there can be multiple rows. Those rows have exactly the same values in all columns except columns with `suggestedTarget` role.
  108. Of special note is the "refersTo" field, which allows linking of tables to other entities. A typical "refersTo" entry looks like the following.
  109. ```
  110. "colIndex":1,
  111. "colName":"customerID",
  112. "colType":"integer",
  113. "role":["attribute"],
  114. "refersTo":{
  115. "resID":"0",
  116. "resObject":{
  117. "columnName":"custID"}}
  118. ```
  119. Here, the customerID column in one table is linking to another resource whose "resID"="0", which is presumably a different table. Further, the object of reference, "resObject", is a column whose "columnName"="custID".
  120. A table entry can also refer to other kinds of resources besides another tables, including, a raw file in a collection of raw files (e.g., img1.png in a collection of images), a node in a graph, an edge in a graph. The below table lists the different resource objects that can be referenced.
  121. | resObject | resource referenced | Description |
  122. |-----------|---------------------|-------------|
  123. | columnIndex, columnName | table | Entries in this column refer to entries in another table (foreign key). See [example](#case-3) |
  124. | item | a collection of raw files | Each entry in this column refers to an item in a collection of raw files. See [example](#case-2) |
  125. | nodeAttribute | graph | Entries in this column refer to attribute values of a node in a graph. This can include a reference to the nodeID|
  126. | edgeAttribute | graph | Entries in this column refer to attribute values of an edge in a graph. This can include a reference to the edgeID|
  127. __Notes regarding graph node/edge roles__
  128. The parent roles of egdeSource/edgeTarget can be combined with derived roles such as directedEdgeSource/directedEdgeTarget and undirectedEdgeSource/undirectedEdgeTarget to indicate directed or undirected edges in a graph. Similarly, the metadata can capture information about simple/multi-edges using multiEdgeSource/multiEdgeTarget and simpleEdgeSource/simpleEdgeTarget.
  129. Next, we will see how this approach of treating everything as a resource and using references to link resources can handle a range of cases that can arise.
  130. ## Case 1: Single table
  131. In many openml and other tabular cases, all the learning data is contained in a single tabular file. In this case, an example dataset will look like the following.
  132. ```
  133. <dataset_id>/
  134. |-- tables/
  135. |-- learningData.csv
  136. d3mIndex,sepalLength,sepalWidth,petalLength,petalWidth,species
  137. 0,5.2,3.5,1.4,0.2,I.setosa
  138. 1,4.9,3.0,1.4,0.2,I.setosa
  139. 2,4.7,3.2,1.3,0.2,I.setosa
  140. 3,4.6,3.1,1.5,0.2,I.setosa
  141. 4,5.0,3.6,1.4,0.3,I.setosa
  142. 5,5.4,3.5,1.7,0.4,I.setosa
  143. ...
  144. |-- datasetDoc.json
  145. ```
  146. The [datasetDoc.json](examples/iris.datasetDoc.json) for this example shows how this case is handled. Note that in this datasetDoc.json, the "role" of the column "species" is "suggestedTarget". The reason for this is given [here](FAQ.md#suggested-targets).
  147. ## Case 2: Single table, multiple raw files <a name="case-2"></a>
  148. Consider the following sample image classification dataset. It has one learningData.csv file, whose "image" column has pointers to image files.
  149. ```
  150. <dataset_id>/
  151. |-- media/
  152. |-- img1.png
  153. |-- img2.png
  154. |-- img3.png
  155. |-- img4.png
  156. |-- tables/
  157. |-- learningData.csv
  158. d3mIndex,image,label
  159. 0,img1.png,cat
  160. 1,img2.png,dog
  161. 2,img3.png,dog
  162. 3,img4.png,cat
  163. |-- datasetDoc.json
  164. ```
  165. The [datasetDoc.json](examples/image.datasetDoc.json) for this example shows how this case is handled. Two things to note in this datasetDoc.json are:
  166. - We define all the images in media/ as a single resource, albeit a collective resource (notice "isCollection"=true).
  167. ```
  168. {
  169. "resID": "0",
  170. "resPath": "media/",
  171. "resType": "image",
  172. "resFormat": ["img/png"],
  173. "isCollection": true,
  174. }
  175. ```
  176. - We reference this image-collection resource ("resID"="1") in the "image" column of the "learningData.csv" resource:
  177. ```
  178. {
  179. "colIndex": 1,
  180. "colName": "image",
  181. "colType": "string",
  182. "role": "attribute",
  183. "refersTo":{
  184. "resID": "0",
  185. "resObject": "item"
  186. }
  187. }
  188. ```
  189. The semantics here is as follows: The entries in the "image" column refers to an item ("resObject"="item") in a resource whose "resID"="0". Therefore, an entry 'img1.png' in this column refers to an item of the same name in the image collection. Since we know the relative path of this image-collection resource ("resPath"="media/"), we can locate "img1.png" at <dataset_id>/media/img1.png.
  190. ## Case 3: Multiple tables referencing each other <a name="case-3"></a>
  191. Consider the following dataset containing multiple relational tables referencing each other.
  192. ```
  193. <dataset_id>/
  194. |-- tables/
  195. |-- customers.csv
  196. custID,country,first_invoices_time,facebookHandle
  197. 0,USA,12161998,"johnDoe"
  198. 1,USA,06272014,"janeSmith"
  199. 2,USA,12042014,"fooBar"
  200. 3,AUS,02022006,"johnyAppleseed"
  201. ...
  202. |-- items.csv
  203. stockCode,first_item_purchases_time,Description
  204. FIL36,02162005,waterfilter
  205. MAN42,06272014,userManual
  206. GHDWR2,01112012,generalHardware
  207. ...
  208. |-- invoices.csv
  209. invoiceNo,customerID,first_item_purchases_time
  210. 0,734,12161998
  211. 1,474,11222010
  212. 2,647,10182011
  213. ...
  214. |-- learningData.csv
  215. item_purchase_ID,invoiceNo,invoiceDate,stockCode,unitPrice,quantity,customerSatisfied,profitMargin
  216. 0,36,03142009,GYXR15,35.99,2,1,0.1,
  217. 1,156,02022006,TEYT746,141.36,2,0,0.36
  218. 2,8383,11162010,IUYU132,57.25,2,0,0.22
  219. 3,3663,12042014,FURY145,1338.00,1,1,0.11
  220. 4,6625,07072013,DSDV97,762.00,1,1,0.05
  221. ...
  222. ```
  223. Here, "learningData" references both "invoice" and "items" tables. "invoice" table in turn references "customers" table. The [datasetDoc.json](examples/multitable.datasetDoc.json) for this example shows how this case is handled.
  224. There are a number of important things to notice in this example:
  225. - The "isColleciton" flag is False for all the resources, indicating that there are no collection-of-files resource as was the case in the image example above.
  226. - As shown in the snippet below, columns in one resource are referencing columns in another resource.
  227. ```
  228. -- In "resID"="1" --- -- In "resID"="3" --- -- In "resID"="3" ---
  229. "colIndex":1, "colIndex":1, "colIndex":3,
  230. "colName":"customerID", "colName":"invoiceNo", "colName":"stockCode",
  231. "colType":"integer", "colType":"integer", "colType":"string",
  232. "role":["attribute"], "role":["attribute"], "role":["attribute"],
  233. "refersTo":{ "refersTo":{ "refersTo":{
  234. "resID":"0", "resID":"1", "resID":"2",
  235. "resObject":{ "resObject":{ "resObject":{
  236. "columnName":"custID"}} "columnIndex":0}} "columnName":"stockCode"}}
  237. ```
  238. - A column can have multiple roles. For instance in "customer.csv", the column "facebookHandle" has two roles ("role":["attribute", "key"]). It is a key because it satisfies the uniqueness constraint. (see [here](FAQ.md#index-key) for distinction between index and key roles.)
  239. ```
  240. {
  241. "colIndex":3,
  242. "colName":"facebookHandle",
  243. "colType":"string",
  244. "role":["attribute", "key"]
  245. }
  246. ```
  247. - Finally, a column can be referenced using its columnIndex or columnName. See "resID"=3 entries above. Ideally we want column names participating in a reference for readability reasons, however, occasionally we have come across datasets with tables that do not have unique column names. In such cases, columnIndex is used.
  248. ## Case 4: Referencing Graph Elements <a name="case-4"></a>
  249. Assume a sample graph resource below, G2.gml, whose 'resID' = 1
  250. ```
  251. graph [
  252. node [
  253. id 0
  254. label "1"
  255. nodeID 1
  256. ]
  257. node [
  258. id 1
  259. label "88160"
  260. nodeID 88160
  261. ]
  262. ...
  263. ```
  264. A corresponding learningData.csv
  265. | d3mIndex | G2.nodeID | class |
  266. |----------|-----------|-------|
  267. |0 |88160 |1 |
  268. |... |... |... |
  269. A corresponding datasetDoc.json
  270. ```
  271. {
  272. "colIndex": 2,
  273. "colName": "G2.nodeID",
  274. "colType": "integer",
  275. "role": [
  276. "attribute"
  277. ],
  278. "refersTo": {
  279. "resID": "1",
  280. "resObject": "node"
  281. }
  282. },
  283. ```
  284. - "resID": "1" - this points to the graph resource G2.gml
  285. - "resObject": "node" - this points to a node in that graph. **All nodes have a 'nodeID' in the GML file and all nodes are uniquely identified and referenced by its nodeID.**
  286. - So, in the learningData table above, row 0 points to a node in G2.gml whose nodeID = 88160:
  287. ```
  288. node [
  289. id 1
  290. label "88160"
  291. nodeID 88160
  292. ]
  293. ```
  294. **Note:** nodeID's can also be strings as shown in the example below:
  295. ```
  296. graph.gml (resID=1) learningData.csv datasetDoc.json
  297. =================== ================= ================
  298. graph [ | d3mIndex | child.nodeID | parent.nodeID | bond | ...
  299. node [ |----------|--------------|---------------|------| {
  300. id 0 | 0 | BSP | OPSIN | C | "colIndex": 1,
  301. label "Protein Acyl Transferases" | 1 | RHOD | OPSIN | NC | "colName": "child.nodeID",
  302. nodeID "PATS" | 2 | RXN | GPCR | NC | "colType": "string",
  303. ] | ... | ... | ..... | ... | "role": [
  304. node [ "attribute"
  305. id 1 ],
  306. label "Blue Sensitive Opsins" "refersTo": {
  307. nodeID "BSoP" "resID": "1",
  308. ] "resObject": "node"
  309. ... }},
  310. ...
  311. ```
  312. ## Case 5: Graph resources represented as edge lists <a name="case-5"></a>
  313. In v3.1.1, we introduced a new "resType" called "edgeList". This is how a dataset migth look for a case where the graphs are represented using edge lists.
  314. ```
  315. <dataset_id>/
  316. |-- tables/
  317. |-- learningData.csv
  318. d3mIndex,user_ID,user_location,gender,age,high_valued_user
  319. 0,2048151474,46614,M,37,"johnDoe",1
  320. 1,6451671537,08105,F,55,"janeSmith",1
  321. 2,6445265804,31021,F,23,"fooBar",1
  322. 3,0789614390,11554,58,"johnyAppleseed",1
  323. ...
  324. |-- graphs
  325. |-- mentions.edgelist.csv
  326. edgeID,user_ID,mentioned_userID
  327. 0,2048151474,0688363262
  328. 1,2048151474,1184169266
  329. 2,2048151474,3949024994
  330. 3,2048151474,7156662506
  331. ...
  332. ```
  333. It's corresponding datasetDoc.json will look like the following:
  334. ```
  335. ...
  336. {
  337. "resID": "0",
  338. "resPath": "graphs/mentions.edgelist.csv",
  339. "resType": "edgeList",
  340. "resFormat": ["text/csv"],
  341. "isCollection": false,
  342. "columns": [
  343. {
  344. "colIndex": 0,
  345. "colName": "edgeID",
  346. "colType": "integer",
  347. "role": [
  348. "index"
  349. ],
  350. },
  351. {
  352. "colIndex": 1,
  353. "colName": "user_ID",
  354. "colType": "integer",
  355. "role": [
  356. "attribute"
  357. ],
  358. "refersTo": {
  359. "resID": "1",
  360. "resObject":{
  361. "columnName":"user_ID"}
  362. }
  363. },
  364. {
  365. "colIndex": 2,
  366. "colName": "mentioned_userID",
  367. "colType": "integer",
  368. "role": [
  369. "attribute"
  370. ],
  371. "refersTo": {
  372. "resID": "1",
  373. "resObject":{
  374. "columnName":"user_ID"}
  375. }
  376. }
  377. ]
  378. },
  379. {
  380. "resID": "learningData",
  381. "resPath": "tables/learningData.csv",
  382. "resType": "table",
  383. "resFormat": ["text/csv"],
  384. "isCollection": false,
  385. "columns": [
  386. {
  387. "colIndex": 0,
  388. "colName": "d3mIndex",
  389. "colType": "integer",
  390. "role": [
  391. "index"
  392. ],
  393. },
  394. {
  395. "colIndex": 1,
  396. "colName": "user_ID",
  397. "colType": "integer",
  398. "role": [
  399. "attribute"
  400. ]
  401. },
  402. {
  403. "colIndex": 2,
  404. "colName": "user_location",
  405. "colType": "categorical",
  406. "role": [
  407. "attribute"
  408. ],
  409. },
  410. ...
  411. },
  412. ...
  413. ```
  414. # Qualities
  415. A dataset can be annotated with its unique properties or __qualities__. The "qualities" section of the datasetSchema specifies the fields that can be used to used to capture those qualities.
  416. | Field | Description |
  417. |-------|-------------|
  418. | qualName | name of the quality variable |
  419. | qualValue | value of the quality variable |
  420. | qualValueType | data type of the quality variable; can be one of ["boolean","integer","real","string","dict"]|
  421. | qualValueUnits | units of the quality variable |
  422. | restrictedTo | if a quality is restricted to specific resource and not the entire dataset, this field captures that constraint |
  423. __Note:__ We are not providing a taxonomy of qualities but, through this section in the dataset schema, are providing a provision for capturing the qualities of a dataset in its datasetDoc.json file.
  424. Some qualities apply to an entire dataset. For instance, if one wants to annotate a dataset with (1) information about the presence of multi relational tables or not, (2) requires LUPI (Learning Using Privileged Information) or not, and/or (3) its approximate size:
  425. ```
  426. "qualities":[
  427. {
  428. "qualName":"multiRelationalTables",
  429. "qualValue":true,
  430. "qualValueType":"boolean"
  431. },
  432. {
  433. "qualityName":"LUPI",
  434. "qualityValue":true,
  435. "qualValueType":"boolean"
  436. },
  437. {
  438. "qualName":"approximateSize",
  439. "qualValue":300,
  440. "qualValueType":"integer"
  441. "qualValueUnits":"MB"
  442. }]
  443. ```
  444. Some qualities are applicable only to certain resources within a dataset. For instance, "maxSkewOfNumericAttributes" is table specific and not dataset wide. In such cases, we include the "restrictedTo" field:
  445. ```
  446. {
  447. "qualName":"maxSkewOfNumericAttributes",
  448. "qualValue":6.57,
  449. "qualValueType":"real",
  450. "restrictedTo":{
  451. "resID":"3" // redID 3 is a table
  452. }
  453. },
  454. ```
  455. Further, some qualities are restricted to not only a specific resource but to a specific component within a resource. For instance, "classEntropy" and "numMissingValues" are only applicable to a column within a table within a dataset. In such cases, the "restrictedTo" field will have additional "resComponent" qualifier:
  456. ```
  457. {
  458. "qualName":"classEntropy",
  459. "qualValue":3.63
  460. "qualValueType":"real",
  461. "restrictedTo":{
  462. "resID":"3",
  463. "resComponent":{"columnName":"customerSatisfied"}
  464. }
  465. },
  466. {
  467. "qualName":"numMissingValues",
  468. "qualValue":12,
  469. "qualValueType":"integer",
  470. "restrictedTo":{
  471. "resID":"3",
  472. "resComponent":{"columnName":"customerSatisfied"}
  473. }
  474. }
  475. ```
  476. Similarly a "privilegedFeature" called "facebookHandle" will be restricted to its column within a table (resID=0, customer table) within a dataset as shown below:
  477. ```
  478. {
  479. "qualName":"privilegedFeature",
  480. "qualValue":true,
  481. "qualValueType":"boolean",
  482. "restrictedTo":{
  483. "resID":"0",
  484. "resComponent":{"colName":"facebookHandle"}
  485. }
  486. }
  487. ```
  488. For a full example see the [datasetDoc.json](examples/multitable.datasetDoc.json) file for the above Case 3.
  489. "resComponent" can also be strings "nodes" and "edges".
  490. # FAQ
  491. #### Why do we have distinct "index" and "key" roles
  492. An field which has unique values is a key. However, a key that is used to uniquely identify a row in a table is an index. There can be multiple keys in a table, but a single index. Both index and key columns can participate in the foreign key relationship with other tables.
  493. #### Why is "role" filed a list?
  494. A column can play multiple roles. For instance, it can be a "locationIndicator" and an "attribute", a "key" and an "attribute", and so on.
  495. #### How are LUPI datasets represented
  496. LUPI datasets contain columns for variables which are not available during testing (testing dataset split does not contain data for those columns).
  497. Such dataset can have some columns marked with role "suggestedPrivilegedData" to suggest that those columns might be used in the problem description as LUPI columns. The problem description has a list of such columns specified through "privilegedData" list of columns.
  498. #### Why are we using "suggestedTarget" role instead of "target" in datasetSchema?
  499. Defining targets is inherently in the domain of a problem. The best we can do in a dataset is to make a suggestion to the problem that these are potential targets. Therefore we use "target" in the problemSchema vocabulary and "suggestedTarget" in the datasetSchema vocabulary.
  500. #### What is the use for the role 'boundaryIndicator'
  501. In some datasets, there are columns that denote boundaries and is best not treated as attributes for learning. In such cases, this tag comes in handy. Example:
  502. ```
  503. d3mIndex,audio_file,startTime,endTime,event
  504. 0,a_001.wav,0.32,0.56,car_horn
  505. 1,a_002.wav,0.11,1.34,engine_start
  506. 2,a_003.wav,0.23,1.44,dog_bark
  507. 3,a_004.wav,0.34,2.56,dog_bark
  508. 4,a_005.wav,0.56,1.22,car_horn
  509. ...
  510. ```
  511. Here, 'startTime' and 'endTime' indicate the start and end of an event of interest in the audio file. These columns are good candidates for the role of boundary indicators.
  512. A boundary column can contain `refersTo` reference to point to the column for which it is a boundary.
  513. If `refersTo` is not provided it means that a boundary column is a boundary for the first preceding non-boundary column.
  514. E.g., in the example above, `startTime` and `endTime` columns are boundary columns for `audio_file` column
  515. (or more precisely, for files pointed to by the `audio_file` column).

全栈的自动化机器学习系统,主要针对多变量时间序列数据的异常检测。TODS提供了详尽的用于构建基于机器学习的异常检测系统的模块,它们包括:数据处理(data processing),时间序列处理( time series processing),特征分析(feature analysis),检测算法(detection algorithms),和强化模块( reinforcement module)。这些模块所提供的功能包括常见的数据预处理、时间序列数据的平滑或变换,从时域或频域中抽取特征、多种多样的检测算