Clustering
1. To generate bibliographic coupling research fronts, it is first necessary to cluster the papers using bibliographic coupling.
2. On the main TFGUI window, select cooccurrence > Use bib coupling default.
3. The coocur_gui window will appear. The parameters in the windows are described below:
-
Primary entity will be paper, Relative entity will be reference. This means that papers will be clustered by common references. (bilbiographic coupling).
-
Occurrence threshold is used for clustering cited entities (references, reference authors, reference journals). It should be set to 0 in this case.
-
Cooccurrence threshold is used to exclude entities that are not strongly linked to any other entities. For bibliographic coupling, a value of 5 usually works well. This means that any papers that do not have at least 5 common references with some other paper will be excluded. Generally, this will drop the number of papers to 60% to 75% of its original size. A cooccurrence threshold of 3 or 4 can be used to cluster small datasets (<500 papers), or datasets taken from poorling linked fields such as are found in social sciences. Using a coocurrence threshold of less than 3 usually produces poor clustering.
-
Similarity threshold will exclude entities that don't have a similarity value to at least one other entity that is above some threshold. For bibliographic coupling clustering shown here, the coocurrence threshold usually does a good job of excluding papers that aren't well linked to other papers. Normally, the similarity threshold is not used, and simply set to 0.
-
Number of clusters is a user supplied parameter. A good rule of thumb is to use 10 clusters for every 500 papers. For the prion example, with 800 papers, we will generate 15 clusters.
-
Similarity method defaults to 'dice' similarity, Pearson's rxy is available but not often used.
-
Overlap is useful for clustering primary entities that can occur many times with a secondary entity. For example when clustering reference authors by common paper (author cocitation analysis) or clustering paper journals by common reference journals. Run time for calculating overlap can be very long, so don't use it for clustering more than a couple of hundred entities. For bibliographic coupling, don't use overlap, it gives the same results as the default of matrix multiplication, and takes forever to calculate.
-
Keepers are used to keep certain papers on the map that don't satisfy the cooccurrence criteria. It's sometimes used to display key papers that occur early in the development of a specialty, which may not have many bibliographic coupling links to other papers in the dataset. Keepers can be generated by giving a times cited criteria, that is, keep papers that have been cited above the citation threshold. Keepers usually don't map well because of their lack of bibliographic coupling, so this feature is usually not used.
-
The output variable is a MATLAB structure variable that will hold the results of clustering. The cluster structure variable project.bib is the default, and is expected by many of the timeline programs. Always use the default project.bib unless you have some specific reason to change it.
4. Select Execute to proceed with clustering. You may get a dialog box asking Do you want to overwrite project.bib? This is to prevent inadvertant overwriting of previous results. Select Yes to proceed with clustering, or No to cancel clustering.
5. Some proceedural messages will appear in the MATLAB command window, followed by a dialog box giving the number of items that will be clustered. Select Yes on this dialog box to proceed. This dialog box is normally used during cocitation clustering to iteratively vary the occurrence threshold to get an appropriate number of items clustered. The sample dialog box below if from the prion dataset, using a coocurrence threshold of 5. Note that the number of papers that will be clustered will drop about 75% of th original number of papers in the dataset (804). This is typical of collections of papers from biomedical fields. Discarded papers are typically non-relevant papers that were somehow captured in the original query from WOS, non-journal papers, e.g., letters and book reviews.
6. After clustering, a dendrogram seriation program will run to order the clustering dendrogram leaves for best visualization. For a large number of clusters, seriation may take up to 30 minutes or an hour. The wait bar below will display as seriation proceeds.
7. The clustering results are stored in the MATLAB variable project.bib.
-
project.bib.call: contains an list of parameters used for clustering, for auditing purposes.
project.bib.z is the clustering structure (binary tree) produced by MATLAB's linkage routine which performs aggomerative hierarchical clustering.
project.bib.cluster is lists the cluster number for each paper that was clustered.
project.bib.members are the paper keys to the papers that were clustered
project.bib.sim is the similarity matrix for the papers that were clustered.
project..bib.z1 is a reduced binary clustering structure that is cut off at the number of clusters that were selected.
project.bib.order is the dendrogram order of the clusters, the result of the seriation routine.
8. After using bibliographic coupling clustering to cluster the papers into research fronts, the next step is to produce a research front timeline.
Comments (0)
You don't have permission to comment on this page.