Dataset viewer documentation
Explore statistics over split data
Explore statistics over split data
The dataset viewer provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
The /statistics endpoint requires three query parameters:
dataset: the dataset name, for examplenyu-mll/glueconfig: the subset name, for examplecolasplit: the split name, for exampletrain
Let’s get some stats for nyu-mll/glue dataset, cola subset, train split:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fstatistics%3Fdataset%3Dnyu-mll%2Fglue%26amp%3Bconfig%3Dcola%26amp%3Bsplit%3Dtrain%26quot%3B%3C%2Fspan%3E
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()The response JSON contains three keys:
num_examples- number of samples in a split or number of samples in the first chunk of data if dataset is larger than 5GB (seepartialfield below).statistics- list of dictionaries of statistics per each column, each dictionary has three keys:column_name,column_type, andcolumn_statistics. Content ofcolumn_statisticsdepends on a column type, see Response structure by data types for more detailspartial-trueif statistics are computed on the first 5 GB of data, not on the full split,falseotherwise.
{
"num_examples": 8551,
"statistics": [
{
"column_name": "idx",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 8550,
"mean": 4275,
"median": 4275,
"std": 2468.60541,
"histogram": {
"hist": [
856,
856,
856,
856,
856,
856,
856,
856,
856,
847
],
"bin_edges": [
0,
856,
1712,
2568,
3424,
4280,
5136,
5992,
6848,
7704,
8550
]
}
}
},
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
},
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}
],
"partial": false
}Response structure by data type
Currently, statistics are supported for strings, float and integer numbers, lists, datetimes, audio and image data and the special datasets.ClassLabel feature type of the datasets library.
column_type in response can be one of the following values:
class_label- fordatasets.ClassLabelfeature which represents categorical datafloat- for float data typesint- for integer data typesbool- for boolean data typestring_label- for string data types being treated as categories (see below)string_text- for string data types if they do not represent categories (see below)list- for lists of any other data types (including lists)audio- for audio dataimage- for image datadatetime- for datetime data
class_label
This type represents categorical data encoded as ClassLabel feature. The following measures are computed:
- number and proportion of
nullvalues - number and proportion of values with no label
- number of unique values (excluding
nullandno label) - value counts for each label (excluding
nullandno label)
Example
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
}float
The following measures are returned for float data types:
- minimum, maximum, mean, median, and standard deviation values
- number and proportion of
nullandNaNvalues (NaNvalues are treated asnull) - histogram with 10 bins
Example
{
"column_name": "clarity",
"column_type": "float",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 2,
"mean": 1.67206,
"median": 1.8,
"std": 0.38714,
"histogram": {
"hist": [
17,
12,
48,
52,
135,
188,
814,
15,
1628,
2048
],
"bin_edges": [
0,
0.2,
0.4,
0.6,
0.8,
1,
1.2,
1.4,
1.6,
1.8,
2
]
}
}
}int
The following measures are returned for integer data types:
- minimum, maximum, mean, median, and standard deviation values
- number and proportion of
nullvalues - histogram with less than or equal to 10 bins
Example
{
"column_name": "direction",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 0,
"max": 1,
"mean": 0.49925,
"median": 0.0,
"std": 0.5,
"histogram": {
"hist": [
50075,
49925
],
"bin_edges": [
0,
1,
1
]
}
}
}bool
The following measures are returned for bool data type:
- number and proportion of
nullvalues - value counts for
'True'and'False'values
Example
{
"column_name": "penalty",
"column_type": "bool",
"column_statistics":
{
"nan_count": 3,
"nan_proportion": 0.15,
"frequencies": {
"False": 7,
"True": 10
}
}
}string_label
If the proportion of unique values in a string column within requested split is lower than or equal to 0.2 and the number of unique values is lower than 1000, or if the number of unique values is lower or equal to 10 (independently of the proportion), it is considered to be a category. The following measures are returned:
- number and proportion of
nullvalues - number of unique values (excluding
null) - value counts for each label (excluding
null)
Example
{
"column_name": "answerKey",
"column_type": "string_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"n_unique": 4,
"frequencies": {
"D": 1221,
"C": 1146,
"A": 1378,
"B": 1212
}
}
}
string_text
If string column does not satisfy the conditions to be treated as a string_label, it is considered to be a column containing texts and response contains statistics over text lengths which are calculated by character number. The following measures are computed:
- minimum, maximum, mean, median, and standard deviation of text lengths
- number and proportion of
nullvalues - histogram of text lengths with 10 bins
Example
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}list
For lists, the distribution of their lengths is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of lists lengths
- number and proportion of
nullvalues - histogram of lists lengths with up to 10 bins
Example
{
"column_name": "chat_history",
"column_type": "list",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 1,
"max": 3,
"mean": 1.01741,
"median": 1.0,
"std": 0.13146,
"histogram": {
"hist": [
11177,
196,
1
],
"bin_edges": [
1,
2,
3,
3
]
}
}
}Note that dictionaries of lists are not supported.
audio
For audio data, the distribution of audio files durations is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of audio files durations
- number and proportion of
nullvalues - histogram of audio files durations with 10 bins
Example
{
"column_name": "audio",
"column_type": "audio",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 1.02,
"max": 15,
"mean": 13.93042,
"median": 14.77,
"std": 2.63734,
"histogram": {
"hist": [
32,
25,
18,
24,
22,
17,
18,
19,
55,
1770
],
"bin_edges": [
1.02,
2.418,
3.816,
5.214,
6.612,
8.01,
9.408,
10.806,
12.204,
13.602,
15
]
}
}
}image
For image data, the distribution of images widths is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of widths of image files
- number and proportion of
nullvalues - histogram of images widths with 10 bins
Example
{
"column_name": "image",
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
]
}
}
}datetime
The distribution of datetime is computed. The following measures are returned:
- minimum, maximum, mean, median, and standard deviation of datetimes represented as strings with precision up to seconds
- number and proportion of
nullvalues - histogram of datetimes with 10 bins
Example
{
"column_name": "date",
"column_type": "datetime",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": "2013-05-18 04:54:11",
"max": "2013-06-20 10:01:41",
"mean": "2013-05-27 18:03:39",
"median": "2013-05-23 11:55:50",
"std": "11 days, 4:57:32.322450",
"histogram": {
"hist": [
318776,
393036,
173904,
0,
0,
0,
0,
0,
0,
206284
],
"bin_edges": [
"2013-05-18 04:54:11",
"2013-05-21 12:36:57",
"2013-05-24 20:19:43",
"2013-05-28 04:02:29",
"2013-05-31 11:45:15",
"2013-06-03 19:28:01",
"2013-06-07 03:10:47",
"2013-06-10 10:53:33",
"2013-06-13 18:36:19",
"2013-06-17 02:19:05",
"2013-06-20 10:01:41"
]
}
}
}