You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

problemSchema.md 29 kB

first commit Former-commit-id: 08bc23ba02cffbce3cf63962390a65459a132e48 [formerly 0795edd4834b9b7dc66db8d10d4cbaf42bbf82cb] [formerly b5010b42541add7e2ea2578bf2da537efc457757 [formerly a7ca09c2c34c4fc8b3d8e01fcfa08eeeb2cae99d]] [formerly 615058473a2177ca5b89e9edbb797f4c2a59c7e5 [formerly 743d8dfc6843c4c205051a8ab309fbb2116c895e] [formerly bb0ea98b1e14154ef464e2f7a16738705894e54b [formerly 960a69da74b81ef8093820e003f2d6c59a34974c]]] [formerly 2fa3be52c1b44665bc81a7cc7d4cea4bbf0d91d5 [formerly 2054589f0898627e0a17132fd9d4cc78efc91867] [formerly 3b53730e8a895e803dfdd6ca72bc05e17a4164c1 [formerly 8a2fa8ab7baf6686d21af1f322df46fd58c60e69]] [formerly 87d1e3a07a19d03c7d7c94d93ab4fa9f58dada7c [formerly f331916385a5afac1234854ee8d7f160f34b668f] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18 [formerly 386086f05aa9487f65bce2ee54438acbdce57650]]]] Former-commit-id: a00aed8c934a6460c4d9ac902b9a74a3d6864697 [formerly 26fdeca29c2f07916d837883983ca2982056c78e] [formerly 0e3170d41a2f99ecf5c918183d361d4399d793bf [formerly 3c12ad4c88ac5192e0f5606ac0d88dd5bf8602dc]] [formerly d5894f84f2fd2e77a6913efdc5ae388cf1be0495 [formerly ad3e7bc670ff92c992730d29c9d3aa1598d844e8] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18]] Former-commit-id: 3c19c9fae64f6106415fbc948a4dc613b9ee12f8 [formerly 467ddc0549c74bb007e8f01773bb6dc9103b417d] [formerly 5fa518345d958e2760e443b366883295de6d991c [formerly 3530e130b9fdb7280f638dbc2e785d2165ba82aa]] Former-commit-id: 9f5d473d42a435ec0d60149939d09be1acc25d92 [formerly be0b25c4ec2cde052a041baf0e11f774a158105d] Former-commit-id: 9eca71cb73ba9edccd70ac06a3b636b8d4093b04
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595
  1. # Problem Schema (version 4.0.0)
  2. Dataset schema provides a specification of an abstract data science problem. It is contained in the [problemSchema.json](../schemas/problemSchema.json) file. An instance of this schema is included with every problem in the problemDoc.json file.
  3. Problem schema specifies a dataset in three sections: about, inputs, and expectedOutputs. Each of these sections are described below.
  4. # About
  5. The "about" section contains of some general information about the problem and consists of the following fields.
  6. | Field | Description |
  7. |-----------------------|---------------------------------------------------------------------------------------------------|
  8. | problemID | a unique ID assigned to a problem |
  9. | problemName | the name of a problem |
  10. | problemDescription | a brief description of the problem |
  11. | problemURI | the location of the problem |
  12. | taskKeywords | a list of keywords that capture the nature of the machine learning task |
  13. | problemVersion | the version of the current problem |
  14. | problemSchemaVersion | the version of the problem schema |
  15. Currently, the keywords that can be combined to describe the task are the following:
  16. | taskKeywords |Notes|
  17. |--------------|-----|
  18. |classification|supervised learning task - learn from a labeled dataset to assign a class labels to prediction samples|
  19. |regression|supervised learning task - learn from a labeled dataset to assign a numeric values to prediction samples|
  20. |clustering|unsupervised learning task - no labeled dataset, cluster samples and assign a cluster label to all samples|
  21. |linkPrediction|[linkPrediction](#link-prediction) task|
  22. |vertexNomination|[vertexNomination](#vertex-nomination) task|
  23. |vertexClassification|[vertexClassification](#vertex-classification) task|
  24. |communityDetection|[communityDetection](#community-detection) task|
  25. |graphMatching|[graphMatching](#graph-matching) task|
  26. |forecasting|data is indexed by time dimension and the task is to predict future values based on previously observed values|
  27. |collaborativeFiltering|task of filling blank cells in a utility matrix where each cell in the matrix holds association between two entities (e.g., users and products)|
  28. |objectDetection|[objectDetection](#object-detection-task) task|
  29. |semiSupervised|[semiSupervised](#semi-supervised) learning task|
  30. |unsupervised|unsupervised learning task - no labeled dataset|
  31. |binary|binary classification task|
  32. |multiClass|multi-class classification task|
  33. |multiLabel|multi-label classification task|
  34. |univariate|applied to "regression" task with a single response variable|
  35. |multivariate|applied to "regression" task with more than one response variables|
  36. |overlapping|applied to "communityDetection" problems to indicate overlapping communites: multiple community memberships for nodes|
  37. |nonOverlapping|applied to "communityDetection" problems to indicate disjoint communites: single community memberships for nodes|
  38. |tabular|indicates data is tabular|
  39. |relational|indicates data is a relational database|
  40. |nested |indicates that a table consists of nested tables. For example, a column entry can point to an entire table stored in a separate CSV file (similar to pointing to an image or other media files)|
  41. |image|indicates data consists of raw images|
  42. |audio|indicates data consists of raw audio|
  43. |video|indicates data consists of raw video|
  44. |speech|indicates human speech data|
  45. |text|indicates data consists of raw text|
  46. |graph|indicates data consists of graphs|
  47. |multiGraph|indicates data consists of multigraphs|
  48. |timeSeries|indicates data consists of time series|
  49. |grouped|applied to time series data (or tabular data in general) to indicate that some columns should be [grouped](#grouped-time-series)|
  50. |geospatial|indicates data contains geospatial information|
  51. |remoteSensing|indicates data contains remote-sensing data|
  52. |lupi|indicates the presence of privileged features: [lupi](#LUPI)|
  53. |missingMetadata|indicates that the metadata for dataset is not complete|
  54. # Inputs
  55. This section specifies the three inputs that are required to understand and solve a problem in the context of D3M. They include: data, data splits, and performance metrics.
  56. ```
  57. "inputs":{
  58. "data":[
  59. ...
  60. ],
  61. "dataSplits":{
  62. ...
  63. },
  64. "performanceMetrics":{
  65. ...
  66. }
  67. }
  68. ```
  69. ## Data
  70. Data refers to the dataset(s) over which a problem is defined. A problem can refer to multiple datasets. That is captured by the datasetID field, which is a list.
  71. ```
  72. "data":[
  73. {
  74. "datasetID":"sample_dataset_ID",
  75. ...
  76. }
  77. ]
  78. ```
  79. <a name="target-index"></a>
  80. In addition to the datasetID, one has to also specify target variable(s) on that dataset as part of the problem specification. Each target variable is specified by referring to a table using its "resID" and specifying its target column index and column name. For a correct specification, the target column names and indexes in problemDoc should match with the corresponding column names and indexes in datasetDoc. The redundancy of specifying a target by its column name and index is by design.
  81. ```
  82. "data": [
  83. {
  84. "datasetID":"sample_dataset_ID"
  85. "targets":
  86. [
  87. {
  88. "targetIndex":0,
  89. "resID":"0", // reference to a table in the dataset
  90. "colIndex":18,
  91. "colName":"classLabel"
  92. }
  93. ]
  94. }
  95. ]
  96. ```
  97. For more information about "resID", "colIndex", and "colName", refer to the [datasetSchema.json](../schemas/datasetSchema.json)
  98. "targets" also has an optional "numClusters" field. This field is applicable to clustering problems. It is used to specify the number of clusters to be generated by the solution algorithm (if this information is known apriori).
  99. <a name="forecasting-horizon"></a>
  100. If the task is time series forecasting, then the problem specification can contain additional information about the horizon of forecast. It will contain a number, which indicates the max number of time steps in future the predictions will need to be made. The horizon is in the units of `timeGranularity` in the column metadata. In the following example, assuming that the `timeGranularity` 5 days, the prediction horizon is 10 future steps, that is 50 days:
  101. ```
  102. "forecastingHorizon":{
  103. "resID": "learningData",
  104. "colIndex": 8,
  105. "colName": "time",
  106. "horizonValue": 10.0
  107. }
  108. ```
  109. Data can also contains "privilegedData" list of columns related to unavailable variables during testing. Those columns do not have data available in the test split of a dataset.
  110. ```
  111. "data": [
  112. {
  113. "datasetID":"sample_dataset_ID"
  114. "privilegedData": [
  115. {
  116. "privilegedDataIndex": 0,
  117. "resID": "learningData",
  118. "colIndex": 20,
  119. "colName": "HISTOLOGY"
  120. }
  121. ]
  122. }
  123. ]
  124. ```
  125. ## Data Splits
  126. Every problem has a special __dataSplits__ file. This file contains information about which rows in the learning data are 'TRAIN' rows and which ones are 'TEST' rows. It has the following columns: [d3mIndex, type, repeat, fold] as shown in the problem sample below.
  127. ![](static/sampleProblem.PNG)
  128. This split file indirectly reflects the evaluation procedure that was used to define the problem. The "dataSplits" section in the problem schema contains information that can be used to infer the evaluation procedure and interpret the dataSplits file. It contains the following fields:
  129. | Field | Description |
  130. |-------|-------------|
  131. | method | refers to the evaluation method reflected in the data splits. Currently, it can be one of "holdOut" or "kFold". Can be extended in future to accommodate others. It is always 'holdOut' by default |
  132. | testSize | applicable to "holdOut" method, it specifies the size of the test split in the range 0.0 to 1.0. |
  133. | numFolds | applicable to "kFold" method, it specifies the number of folds |
  134. | stratified | specifies if the split is stratified or not, default value being True |
  135. | numRepeats | specifies how many repeats, e.g., 3 repeats of 50% holdout |
  136. | randomSeed | the random number generator seed that was used to generate the splits |
  137. | splitsFile | the relative path to the splits file from the problem root, which is "dataSplits.csv" directly under the problem root by default|
  138. | splitScript | the relative path from the problem root to the script that was used to create the data split (optional) |
  139. ## Performance metrics
  140. Another important part of a problem specification is set of metrics that will be used to evaluate the performance of a solution. The "performanceMetrics" section of the problem schema is a **list** of metrics. The provided metrics are the following.
  141. | Metric | Notes |
  142. |--------|-------|
  143. | accuracy | sklearn.metrics.accuracy_score |
  144. | precision | sklearn.metrics.precision_score |
  145. | recall | sklearn.metrics.recall_score |
  146. | f1 | sklearn.metrics.f1_score (pos_label=1) |
  147. | f1Micro | sklearn.metrics.f1_score(average='micro') |
  148. | f1Macro | sklearn.metrics.f1_score(average='macro') |
  149. | rocAuc | Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) - only works with binary data, requires confidence |
  150. | rocAucMacro | Compute rocAuc metric for each label and compute unweighted mean - only works for multi-class and multi-label data, requires confidence |
  151. | rocAucMicro | Compute rocAuc metric globally by considering each element of the label indicator matrix as a binary prediction - only works for multi-class and multi-label data, requires confidence |
  152. | meanSquaredError | sklearn.metrics.mean_squared_error, average computed over multiple target columns |
  153. | rootMeanSquaredError | sqrt(sklearn.metrics.mean_squared_error), average computed over multiple target columns |
  154. | meanAbsoluteError | sklearn.metrics.mean_absolute_error, average computed over multiple target columns |
  155. | rSquared | sklearn.metrics.r2_score |
  156. | normalizedMutualInformation | sklearn.metrics.normalized_mutual_info_score |
  157. | jaccardSimilarityScore | sklearn.metrics.jaccard_similarity_score |
  158. | precisionAtTopK | number of values shared between first K entries of ground truth and predicted labels and normalized by K |
  159. | objectDetectionAP | an implementation of mean average precision for object detection problems, mean is computed accross classes (when there are multiple classes), average accross bounding polygons of one class, requires confidence |
  160. | hammingLoss | wraps sklearn.metrics.hamming_loss function - used for multilabel classification problems. |
  161. | meanReciprocalRank | computes the mean of the reciprocal of elements of a vector of rankings - used for linkPrediction problems.|
  162. | hitsAtK | computes how many elements of a vector of ranks make it to the top 'k' positions - used for linkPrediction problems.|
  163. **Note**: There can be multiple performance metrics included for a single problem. For example:
  164. ```
  165. "performanceMetrics": [
  166. {
  167. "metric": "accuracy"
  168. },
  169. {
  170. "metric": "f1"
  171. }
  172. ]
  173. ```
  174. ### Optional parameters for metrics
  175. Notice that there are optional metric parameters in the problemSchema:
  176. ```
  177. ...
  178. "K":{"type":"integer","required":false,"dependencies": {"metric":["precisionAtTopK","hitsAtK"]}},
  179. "posLabel":{"type":"string","required":false, "dependencies": {"metric":["f1","precision","recall"]}}
  180. ...
  181. ```
  182. | parameter | Notes |
  183. |-----------------------|--------------------------------------------------------------------------------------------------------------------------------|
  184. | K | This is applicable to the metrics ['precisionAtTopK', 'hitsAtK'] provided the value of K |
  185. | posLabel | This is applicable to 'f1', 'precision', and 'recall' metrics, and indicates which class should be treated as "positive" class |
  186. # Expected outputs
  187. The "expectedOutputs" in the problem schema directs the solution systems to produce output(s) in a standardized manner.
  188. ```
  189. "expectedOutputs":{"type":"dict","required":true,"schema":{
  190. "predictionsFile": {"type":"string", "required":true, "default":"predictions.csv"},
  191. "scoresFile": {"type":"string", "required":false, "default":"scores.csv"}
  192. }}
  193. ```
  194. ## Predictions file
  195. Currently, there is only one required output, which is the predictions file - all solution systems should output this file. By default the name of this file is "predictions.csv", but can be changed in a problem instance using the "predictionsFile" field in the problem schema.
  196. The structure of predictions file depends on the metric used in the problem description.
  197. ### `objectDetectionAP` metric
  198. `objectDetectionAP` metric requires the following structure:
  199. * `d3mIndex` column
  200. * target columns (names depend on problem target columns)
  201. * `confidence` column
  202. Generally target columns are: a column for class/label and a column for object boundary.
  203. For same input sample (image) multiple output rows can be made (matching in `d3mIndex` value),
  204. but **only** those rows for objects which are determined to be found in the image.
  205. |d3mIndex|class|bounding_box |confidence|
  206. |--------|-----|---------------|----------|
  207. |0 |dog |"..." |... |
  208. |0 |cat |"..." |... |
  209. |0 |cat |"..." |... |
  210. |0 |cat |"..." |... |
  211. |1 |cat |"..." |... |
  212. |1 |cat |"..." |... |
  213. |1 |cat |"..." |... |
  214. |1 |pig |"..." |... |
  215. |1 |pig |"..." |... |
  216. |2 |man |"..." |... |
  217. |2 |man |"..." |... |
  218. |2 |man |"..." |... |
  219. |2 |bird |"..." |... |
  220. |2 |car |"..." |... |
  221. ### `rocAuc`, `rocAucMacro`, `rocAucMicro` metrics
  222. `rocAuc`, `rocAucMacro`, `rocAucMicro` metrics require the following structure:
  223. * `d3mIndex` column
  224. * target column (name depends on the problem target column)
  225. * `confidence` column
  226. Target column represents class/label. For one input sample multiple output rows should be made (matching in `d3mIndex` value),
  227. for **all** possible class/label values, each one row. The `confidence` value should then be confidence of classification
  228. into that class/label.
  229. | d3mIndex | label | confidence |
  230. |----------|-------|------------|
  231. | 640 | 0 | 0.612 |
  232. | 640 | 1 | 0.188 |
  233. | 640 | 2 | 0.2 |
  234. | 641 | 0 | 0.4 |
  235. | 641 | 1 | 0.25 |
  236. | 641 | 2 | 0.35 |
  237. | 642 | 0 | 1.0 |
  238. | 642 | 1 | 0.0 |
  239. | 642 | 2 | 0.0 |
  240. | 643 | 0 | 0.52 |
  241. | 643 | 1 | 0.38 |
  242. | 643 | 2 | 0.1 |
  243. | 644 | 0 | 0.3 |
  244. | 644 | 1 | 0.2 |
  245. | 644 | 2 | 0.5 |
  246. | 645 | 0 | 0.1 |
  247. | 645 | 1 | 0.2 |
  248. | 645 | 2 | 0.7 |
  249. | 646 | 0 | 1.0 |
  250. | 646 | 1 | 0.0 |
  251. | 646 | 2 | 0.0 |
  252. ### `hitsAtK`, `meanReciprocalRank`
  253. `hitsAtK` and `meanReciprocalRank`metrics require the following structure:
  254. * `d3mIndex` column
  255. * target column (name depends on the problem target column)
  256. * `rank` column
  257. `rank` is a reserved keyword. Target column represents class/label. For one input sample multiple output rows should be made (matching in `d3mIndex` value). Unlike `rocAuc` this need not have all possible class/label values. Example:
  258. learningData.csv:
  259. |d3mIndex |subject |object |relationship (target)|
  260. |------------|--------|------------|---------------------|
  261. |0 |James |John |father |
  262. |1 |John |Patricia |sister |
  263. |2 |Robert |Thomas |brother |
  264. |... |... |... |... |
  265. |... |... |... |... |
  266. ground truth (coming from learningData.csv):
  267. |d3mIndex |relationship |
  268. |------------|-------------|
  269. |0 |father |
  270. |1 |sister |
  271. |2 |brother |
  272. predictions.csv:
  273. |d3mIndex |relationships |rank |
  274. |------------|----------------|-----|
  275. |0 |brother |1 |
  276. |0 |cousin |2 |
  277. |0 |mother |3 |
  278. |0 |father |4<sup>*</sup> |
  279. |0 |grandfather |5 |
  280. |1 |sister |1<sup>*</sup> |
  281. |1 |mother |2 |
  282. |1 |aunt |3 |
  283. |2 |father |1 |
  284. |2 |brother |2<sup>*</sup> |
  285. |2 |sister |3 |
  286. |2 |grandfather |4 |
  287. |2 |aunt |5 |
  288. Note that the rank vector = [4,1,2]
  289. MRR = sum(1/ranks)/len(ranks) = 0.58333
  290. Hits@3 = 2/3 = 0.666666; Hits@1 = 1/3 = 0.3333333; Hits@5 = 3/3 = 1.0
  291. ### Other metrics, multi-label
  292. Output has for each input sample multiple output rows (matching in `d3mIndex` value), with `d3mIndex` column and a column for predicted labels.
  293. Only predicted labels should be listed in output rows.
  294. | d3mIndex | label |
  295. |----------|-------|
  296. | 640 | 0 |
  297. | 640 | 1 |
  298. | 641 | 0 |
  299. | 642 | 0 |
  300. | 642 | 1 |
  301. | 642 | 2 |
  302. | 643 | 1 |
  303. | 644 | 1 |
  304. | 645 | 0 |
  305. | 645 | 2 |
  306. | 646 | 2 |
  307. ### Other metrics
  308. For each input sample one output row should be provided (matching in `d3mIndex` value), with `d3mIndex` column and a target column for the prediction,
  309. or multiple target columns in `multivariate` case.
  310. Target column names depend on problem target columns.
  311. | d3mIndex | class |
  312. |----------|-------|
  313. | 640 | 0 |
  314. | 641 | 0 |
  315. | 642 | 2 |
  316. | 643 | 1 |
  317. | 644 | 1 |
  318. | 645 | 0 |
  319. | 646 | 2 |
  320. __Notes__
  321. - A target column is represented in the predictions file by its **colName**, which is specified in problemDoc.json under **inputs**->**data**->**targets**.
  322. - We are interested in predictions for 'TEST' data points only.
  323. - Typically, a predictions file will contain only one target column. If a given problem has multiple independent targets, it will be presented as separate problems, each with a single target. A simple test if multiple targets can be listed in one problem: is there only one confidence computed for those targets.
  324. - If a dataset has multiple targets that are related, then the predictions file will have additional target columns. For example, see [object detection](#object-detection-task).
  325. - Predictions can optionally have one or more of the below listed special columns which are required to compute certain metrics (e.g., `rocAuc` requires `confidence` and `meanReciprocalRank` requires `rank`). Besides `d3mIndex`, the following are reserved column names in predictions:
  326. - `confidence`
  327. - `rank`
  328. ## Scores file
  329. A scores file can be another output from a solution pipeline when the ground truth for test data is known. Hence, this output is optional. Problem schema already specifies the given metrics under `performanceMetrics` under `inputs` section.
  330. The standard scores.csv file can contain the following columns: `[metric, value, normalized, randomSeed, fold]`, with `normalized` (a normalized score into the range [0, 1] where higher is better),
  331. `randomSeed` (random seed value used to run the solution pipeline), and `fold` (a 0-based index of the fold) being optional. `randomSeed` is the main seed used by a solution pipeline.
  332. `metric` is the name of the metric as defined in `performanceMetrics` or equivalent (like d3m core package enumeration).
  333. The baselines systems produced by MIT-LL will produce both predictions file and scores file.
  334. Old scores file format with columns `[index, metric, value]` has been deprecated, but you might still see it around.
  335. # Data Augmentation
  336. The "dataAugmentation" field provides information about external sources of data that can be used to address the challenge of data augmentation. It is a list of dictionaries, each item corresponding to one external source. Each dictionary for an external source contains two fields: domain and keywords. "domain" captures the application domain(s) of an external dataset (e.g., government, census, economics) and "keywords" capture additional tags that help narrow the search (e.g., housing, household income).
  337. # Appendix
  338. <a name="link-prediction"></a>
  339. ## Link Prediction
  340. Given an input graph/multiGraph, link prediction involves predicting future possible links in the network or predicting missing links due to incomplete data in the network.
  341. <a name="vertex-nomination"></a>
  342. ## Vertex Nomination
  343. Given an input graph, vertex nomination involves learning a model that, to each observed node in a graph, assigns a prioritized nomination list.
  344. <a name="vertex-classification"></a>
  345. ## Vertex Classification
  346. Given an input (possibly node-attributed) graph, vertex classification involves assigning labels/classes to unknown nodes in the network based on the classes (and attributes) of known nodes and the network structure. Similar to traditional classification tasktype, vertex classification can be of subtype binary, multiclass, or multilabel.
  347. <a name="community-detection"></a>
  348. ## Community Detection
  349. Given an input graph/network possibly with a community structure, community detection involves grouping nodes into sets of (overlapping or non-overlapping) groups/clusters that reflect the community structure.
  350. <a name="graph-matching"></a>
  351. ## Graph Matching
  352. Given two input graphs, the task here is to find an approximate matching between these two graphs based on their structural properties. We assume that a partial correspondence between graphs is given. Given two other nodes not specified in the partial correspondence, the model has to predict if they match/correspond or not.
  353. <a name="semi-supervised"></a>
  354. ## Semi-supervised Classification
  355. This task requires learning classification models from both labeled and unlabeled data. The structure of this task is the same as the standard classification task, except that the target column will only have a small number of the labels provided. The rest of the values in the target column will be empty. Similar to supervised classification, semi-supervised classification can be of subtype binary, multiclass, or multilabel.
  356. ## Semi-supervised Regression
  357. This task requires learning regression models from both labeled and unlabeled data. The structure of this task is the same as the standard regression task, except that the target column will only have a small number of the labels provided. The rest of the values in the target column will be empty. Similar to supervised regression, semi-supervised regression can be of subtype univariate, or multivariate.
  358. <a name="LUPI"></a>
  359. ## Learning Using Privileged Information
  360. This is a special case of classification and regression problems where the training set has certain extra features called privileged features and the test set does not have those features. The models have to deal with feature mismatch across train/test split and take advantage of extra features available at train time.
  361. <a name="grouped-time-series"></a>
  362. ## Grouped time series forecasting
  363. Adding "grouped" qualifier to a time-series forecasting task implies that the data is grouped hierarchically. For example, the total monthly sales data of a bike manufacturer might organized per "bike-model" per "region". In other words, we get individual time series of monthly sales if the whole table is grouped buy two both ("region","bike-model"). The metadata of the dataset in datasetDoc.json will have "suggestedGroupingKey" role set on those columns that would serve as a grouping key.
  364. <a name="object-detection-task"></a>
  365. ## Object Detection Task
  366. The typical structure of an objectDetection task is as follows:
  367. ```
  368. LL1_some_object_detection_task /
  369. |-- LL1_some_object_detection_task_dataset /
  370. |-- tables /
  371. |-- learningData.csv
  372. |-- media /
  373. |-- img01.jpg
  374. |-- img02.jpg
  375. |-- img03.jpg
  376. |-- img04.jpg
  377. ...
  378. |-- datasetDoc.json
  379. |-- LL1_some_object_detection_task_problem /
  380. |-- problemDoc.json
  381. |-- dataSplits.csv
  382. |-- LL1_some_object_detection_task_solution /
  383. |-- src /
  384. |-- predictions.csv
  385. |-- scores.csv
  386. ```
  387. ### learningData.csv
  388. An example learningData:
  389. | d3mIndex | image | someOtherCol1 | someOtherCol2 | ...| class |bounding_box|
  390. |-----------|--------|---------------|---------------|----|-------|------------|
  391. |0|img01.jpg|someVal|someVal|...| dog |"10,10,60,11,60,100,10,100"|
  392. |0|img01.jpg|someVal|someVal|...| dog |"10,20,60,31,60,110,10,110"|
  393. |0|img01.jpg|someVal|someVal|...| tree| "20,10,70,11,70,100,20,100"|
  394. |...|...|...|...|...|...|...|...|...|...|
  395. |2|img03.jpg|someVal|someVal|...| cat |"50,6,100,13,100,131,50,54"|
  396. |2|img03.jpg|someVal|someVal|...| car |"70,6,120,13,120,131,70,54"|
  397. |...|...|...|...|...|...|...|...|...|...|
  398. __Notes__
  399. - Each row corresponds to a bounding box in an image
  400. - There is a one-to-one correspondence between d3mIndex and an image. Therefore, d3mIndex is a multiIndex as it can have non-unique values as shown above.
  401. - For this type of the datasets, the target column is a bounding polygon (named "bounding_box" in the above example).
  402. - The type of the bounding_box column is "realVector" and captures 4 vertices using 8 coordinate values.
  403. - The role of the bounding_box column is "boundingPolygon".
  404. - Object detection dataset and problem can have a "class" target column. This column identifies which object is present inside the bounding polygon, when there are multiple classes of objects can be detected.
  405. - The role of the class cloumn is suggestedTarget, indicating that model should output a class associated with a bounding box. Again if this is a case of a single class, all the output values will be the same.
  406. - Typically, the metric that is used for this problem type requires that a confidence value be output along with the prediction.
  407. ### datasetDoc.json
  408. An example datasetDoc:
  409. ```
  410. ...
  411. "dataResources": [
  412. {
  413. "resID": "0",
  414. "resPath": "media/",
  415. "resType": "image",
  416. "resFormat": [
  417. "image/png"
  418. ],
  419. "isCollection": true
  420. },
  421. {
  422. "resID": "learningData",
  423. "resPath": "tables/learningData.csv",
  424. "resType": "table",
  425. "resFormat": [
  426. "text/csv"
  427. ],
  428. "isCollection": false,
  429. "columns": [
  430. {
  431. "colIndex": 0,
  432. "colName": "d3mIndex",
  433. "colType": "integer",
  434. "role": [
  435. "multiIndex"
  436. ]
  437. },
  438. {
  439. "colIndex": 1,
  440. "colName": "image",
  441. "colType": "string",
  442. "role": [
  443. "index"
  444. ],
  445. "refersTo": {
  446. "resID": "0",
  447. "resObject": "item"
  448. }
  449. },
  450. {
  451. "colIndex": 2,
  452. "colName": "class",
  453. "colType": "string",
  454. "role": [
  455. "suggestedTarget"
  456. ]
  457. },
  458. {
  459. "colIndex": 3,
  460. "colName": "bounding_box",
  461. "colType": "realVector",
  462. "role": [
  463. "suggestedTarget",
  464. "boundingPolygon"
  465. ],
  466. "refersTo": {
  467. "resID": "learningData",
  468. "resObject": {
  469. "columnName": "image"
  470. }
  471. }
  472. }
  473. ]
  474. }
  475. ...
  476. ```
  477. # FAQ
  478. #### Why did we include repeat and fold columns in dataSplits?
  479. A number of datasets come with hold-out CV and k-fold CV splits provided. To leverage these splits and not create our own, we have included fold and repeat columns.
  480. #### How is a problem with multiple target variables handled?
  481. If the targets are independent, then we will create multiple problems, one for each target. On the other hand, if the targets are joint, he "targets" field in problemSchema will contain a list of dictionaries, one for each target description in the problem. An example of joint multi-target is the object detection problem where the output is both a class (e.g., dog) and a bounding polygon.
  482. ```
  483. "targets": [
  484. {
  485. "targetIndex": 0,
  486. "resID": "1",
  487. "colIndex": 2,
  488. "colName": "class"
  489. },
  490. {
  491. "targetIndex": 1,
  492. "resID": "1",
  493. "colIndex": 3,
  494. "colName": "bounding_box"
  495. }
  496. ]
  497. ```
  498. #### How to submit predictions when there are multiple targets?
  499. The target columns are carried over in the predictions file as shown:
  500. |d3mIndex|class|bounding_box |confidence|
  501. |--------|-----|---------------|----------|
  502. |0 |dog |"..." |... |
  503. |0 |cat |"..." |... |
  504. |0 |cat |"..." |... |
  505. |0 |cat |"..." |... |

全栈的自动化机器学习系统,主要针对多变量时间序列数据的异常检测。TODS提供了详尽的用于构建基于机器学习的异常检测系统的模块,它们包括:数据处理(data processing),时间序列处理( time series processing),特征分析(feature analysis),检测算法(detection algorithms),和强化模块( reinforcement module)。这些模块所提供的功能包括常见的数据预处理、时间序列数据的平滑或变换,从时域或频域中抽取特征、多种多样的检测算