Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter - Medical lexicon Dataset
Amira Ghenai
University of Waterloo
This site provides the medical lexicon generated in the work presented in "Catching Zika Fever: Tracking Health Misinformation in Twitter".
See also: The full paper is available here.
We compute the medical lexicon of `infectious disease' wikipedia pages using two different corpuses:
- Medical corpus (all wikipedia pages about infectious disease)
- Wikipedia corpus (top words in all wikipedia)
For every word in every corpus, we compute the probabilities as follows:
- In the medical corpus, we compute the probability of every word as follows:
mp_w (medical corpus probability of word) = frequency of w in medical corpus / total number of words in medical corpus.
- In the wikipedia corpus, we compute the probability of every word as follows:
wp_w = frequency of w in wikipedia / total number of wikipedia words.
- Then, for every word w in both corpus:
- If w is in the medical corpus but not in the wikipedia corpus or the opposite is true, p_w = 0.
- if w is in both the medical and wikipedia corpus, p_w = mp_w - wp_w (medical corpus probability - wikipedia corpus probability)
- Finally, we sort words with p_w in descending order and pick the top words to form the corpus.
The set of medical and wikipedia lexicons may be downloaded here:
Both the medical_corpus.txt and the wikipedia_corpus.txt files contain 22123 words. Every word is in a separate line and every line is in the format of: WORD [TAB] FREQUENCY.
For more details, please read the paper.
Please cite the original publication when using the dataset:
Amira Ghenai, Yelena Mejova, 2017, January. Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter. The Fifth IEEE International Conference on Healthcare Informatics (ICHI 2017), Park City, Utah.