Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter - Medical lexicon Dataset

Amira Ghenai

University of Waterloo

This site provides the medical lexicon generated in the work presented in "Catching Zika Fever: Tracking Health Misinformation in Twitter".

See also: The full paper is available here.

We compute the medical lexicon of `infectious disease' wikipedia pages using two different corpuses:

For every word in every corpus, we compute the probabilities as follows:
  1. In the medical corpus, we compute the probability of every word as follows: mp_w (medical corpus probability of word) = frequency of w in medical corpus / total number of words in medical corpus.
  2. In the wikipedia corpus, we compute the probability of every word as follows: wp_w = frequency of w in wikipedia / total number of wikipedia words.
  3. Then, for every word w in both corpus:
  4. Finally, we sort words with p_w in descending order and pick the top words to form the corpus.

The set of medical and wikipedia lexicons may be downloaded here:
Both the medical_corpus.txt and the wikipedia_corpus.txt files contain 22123 words. Every word is in a separate line and every line is in the format of: WORD [TAB] FREQUENCY.
For more details, please read the paper.

Please cite the original publication when using the dataset:
Amira Ghenai, Yelena Mejova, 2017, January. Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter. The Fifth IEEE International Conference on Healthcare Informatics (ICHI 2017), Park City, Utah.