The citation graph dataset, including cora, citeseer and pubmeb.
Nodes mean authors and edges mean citation relationships.
Parameters:
name (str) – name can be ‘cora’, ‘citeseer’ or ‘pubmed’.
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.
Loads input data from data directory and reorder graph for better locality
ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
ind.name.allx => the feature vectors of both labeled and unlabeled training instances
(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;
ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object;
ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object;
ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
object;
ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.
Nodes mean scientific publications and edges
mean citation relationships. Each node has a
predefined feature with 3703 dimensions. The
dataset is designed for the node classification
task. The task is to predict the category of
certain publication.
Statistics:
Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
Train: 120
Valid: 500
Test: 1000
Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.
In citeseer dataset, there are some isolated nodes in the graph.
These isolated nodes are added as zero-vecs into the right position.
Examples
>>> dataset=CiteseerGraphDataset()>>> g=dataset[0]>>> num_class=dataset.num_classes>>>>>> # get node feature>>> feat=g.ndata['feat']>>>>>> # get data split>>> train_mask=g.ndata['train_mask']>>> val_mask=g.ndata['val_mask']>>> test_mask=g.ndata['test_mask']>>>>>> # get labels>>> label=g.ndata['label']
A mini-dataset for binary classification task using Cora.
After loaded, it has following members:
graphs : list of DGLGraph
pmpds : list of scipy.sparse.coo_matrix
labels : list of numpy.ndarray
Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
Nodes mean paper and edges mean citation
relationships. Each node has a predefined
feature with 1433 dimensions. The dataset is
designed for the node classification task.
The task is to predict the category of
certain paper.
Statistics:
Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
Train: 140
Valid: 500
Test: 1000
Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.
>>> dataset=CoraGraphDataset()>>> g=dataset[0]>>> num_class=dataset.num_classes>>>>>> # get node feature>>> feat=g.ndata['feat']>>>>>> # get data split>>> train_mask=g.ndata['train_mask']>>> val_mask=g.ndata['val_mask']>>> test_mask=g.ndata['test_mask']>>>>>> # get labels>>> label=g.ndata['label']
Nodes mean scientific publications and edges
mean citation relationships. Each node has a
predefined feature with 500 dimensions. The
dataset is designed for the node classification
task. The task is to predict the category of
certain publication.
Statistics:
Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
Train: 60
Valid: 500
Test: 1000
Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.
>>> dataset=PubmedGraphDataset()>>> g=dataset[0]>>> num_class=dataset.num_of_class>>>>>> # get node feature>>> feat=g.ndata['feat']>>>>>> # get data split>>> train_mask=g.ndata['train_mask']>>> val_mask=g.ndata['val_mask']>>> test_mask=g.ndata['test_mask']>>>>>> # get labels>>> label=g.ndata['label']
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
‘Computer’ part of the AmazonCoBuy dataset for node classification task.
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015],
where nodes represent goods, edges indicate that two goods are frequently bought together, node
features are bag-of-words encoded product reviews, and class labels are given by the product category.
Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds
the reverse edges and remove the duplicates, hence with a different number)
Number of classes: 10
Node feature size: 767
Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
Zachary’s karate club is a social network of a university
karate club, described in the paper “An Information Flow
Model for Conflict and Fission in Small Groups” by Wayne W. Zachary.
The network became a popular example of community structure in
networks after its use by Michelle Girvan and Mark Newman in 2002.
Official website: http://konect.cc/networks/ucidata-zachary/
Karate Club dataset statistics:
Nodes: 34
Edges: 156
Number of Classes: 2
Parameters:
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
Protein-Protein Interaction dataset for inductive node classification
A toy Protein-Protein Interaction network dataset. The dataset contains
24 graphs. The average number of nodes per graph is 2372. Each node has
50 features and 121 labels. 20 graphs for training, 2 for validation
and 2 for testing.
mode (str) – Must be one of (‘train’, ‘valid’, ‘test’).
Default: ‘train’
raw_dir (str) – Raw file directory to download/contains the input data directory.
Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset.
Default: False
verbose (bool) – Whether to print out progress information.
Default: True.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns
a transformed version. The DGLGraph object will be
transformed before every access.
Overwrite to realize your own logic of
saving the processed dataset into files.
It is recommended to use dgl.data.utils.save_graphs
to save dgl graph into files and use
dgl.data.utils.save_info to save extra
information into files.