easygraph.datasets.citation_graph module#
Cora, citeseer, pubmed dataset.
- class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
EasyGraphBuiltinDataset
The citation graph dataset, including Cora, CiteSeer and PubMed. Nodes mean authors and edges mean citation relationships.
- Parameters:
name (str) – name can be ‘Cora’, ‘CiteSeer’ or ‘PubMed’.
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
Graph
object and returns a transformed version. TheGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- Attributes:
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- has_cache()[source]#
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property num_classes#
- property num_labels#
- process()[source]#
Loads input data from data directory and reorder graph for better locality
ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances
(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;
ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
object;
ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.
- property reverse_edge#
- property save_name#
- class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Citeseer citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 3327
Edges: 9228
Number of Classes: 6
Label Split:
Train: 120
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.
Examples
>>> dataset = CiteseerGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes:
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]#
Bases:
EasyGraphBuiltinDataset
A mini-dataset for binary classification task using Cora.
After loaded, it has following members:
graphs : list of
DGLGraph
pmpds : list ofscipy.sparse.coo_matrix
labels : list ofnumpy.ndarray
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Attributes:
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Overwrite to realize your own logic of processing the input data.
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- has_cache()[source]#
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- property save_name#
- class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Cora citation network dataset.
Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.
Statistics:
Nodes: 2708
Edges: 10556
Number of Classes: 7
Label split:
Train: 140
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = CoraGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_classes >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes:
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#
Bases:
CitationGraphDataset
Pubmed citation network dataset.
Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.
Statistics:
Nodes: 19717
Edges: 88651
Number of Classes: 3
Label Split:
Train: 60
Valid: 500
Test: 1000
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.reorder (bool) – Whether to reorder the graph using
reorder_graph()
. Default: False.
- num_classes#
Number of label classes
- Type:
int
Notes
The node feature is row-normalized.
Examples
>>> dataset = PubmedGraphDataset() >>> g = dataset[0] >>> num_class = dataset.num_of_class >>> >>> # get node feature >>> feat = g.ndata['feat'] >>> >>> # get data split >>> train_mask = g.ndata['train_mask'] >>> val_mask = g.ndata['val_mask'] >>> test_mask = g.ndata['test_mask'] >>> >>> # get labels >>> label = g.ndata['label']
- Attributes:
hash
Hash value for the dataset and the setting.
name
Name of the dataset.
- num_classes
- num_labels
raw_dir
Raw file directory contains the input data folder.
raw_path
Directory contains the input data files.
- reverse_edge
save_dir
Directory to save the processed dataset.
- save_name
save_path
Path to save the processed dataset.
url
Get url to download the raw dataset.
verbose
Whether to print information.
Methods
download
()Automatically download data and extract it.
has_cache
()Overwrite to realize your own logic of deciding whether there exists a cached dataset.
load
()Overwrite to realize your own logic of loading the saved dataset from files.
process
()Loads input data from data directory and reorder graph for better locality
save
()Overwrite to realize your own logic of saving the processed dataset into files.
- easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get CiteseerGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type:
- easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get CoraGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type:
- easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#
Get PubmedGraphDataset
- Parameters:
raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
- Return type: