easygraph.datasets.citation_graph module#
Cora, citeseer, pubmed dataset.
- class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
EasyGraphBuiltinDataset
The citation graph dataset, including Cora, CiteSeer and PubMed. Nodes mean authors and edges mean citation relationships.
- Parameters:
name (str) – name can be ‘Cora’, ‘CiteSeer’ or ‘PubMed’.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
Graph
object and returns a transformed version. TheGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- has_cache()[source]#
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property num_classes#
- property num_labels#
- process()[source]#
Loads input data from data directory and reorder graph for better locality
ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances
(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;
ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
object;
ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.
- property reverse_edge#
- property save_name#
- class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Citeseer citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
Train: 120
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.
Examples
>>> dataset = CiteseerGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]#
Bases:
EasyGraphBuiltinDataset
A mini-dataset for binary classification task using Cora.
After loaded, it has following members:
graphs : list of
DGLGraph
pmpds : list ofscipy.sparse.coo_matrix
labels : list ofnumpy.ndarray
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- has_cache()[source]#
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property save_name#
- class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Cora citation network dataset.
Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.
Statistics:
Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
Train: 140
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = CoraGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Pubmed citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
Train: 60
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = PubmedGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_of_class >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get CiteseerGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type:
- easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get CoraGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type:
- easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get PubmedGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type: