NLP for COVID-19 data extraction on D-Wave
Hello,
I am a physicist from the University of Texas at Arlington, presently collaborating with the CoronaWhy collaboration on analysis of the CORD-19 COVID Open Research Data sample. The international CoronaWhy consortium is comprised of 500 professionals and enthusiasts in data science and related disciplines, working on specific natural language processing challenges related to parsing and extraction of data from 4GB of fulltext using natural language processing and machine learning.
We are exploring ways that we might apply D-Wave for enhancing performance at some of these tasks. If there is any good literature on the use of D-Wave or similar machines in NLP, this would be very helpful. Please get in touch if you have ideas,
Ben Jones
Comments
Hi Ben,
This is a little outside of my area of expertise. I'm wondering if a member of the Leap community with more relevant NLP experience might chime in.
The techniques that I'm familiar in the ML space may be too general or theoretical for what you need. If you'd like to get a quick review of ML research that's been done with D-Wave quantum computers check out the applications page where you can filter the entries by Application Types -> Machine Learning. You'll find work there with discrete quadratic loss functions, training Boltzmann Machines, and discrete Variational Autoencoders.
https://www.dwavesys.com/applications
I wish you luck!
Also, if you go to searchable code examples page you can filter them by Industry -> Machine Learning & AI to see some open source code examples that you can open in the IDE to review, edit, and test.
https://cloud.dwavesys.com/leap/examples/
Hello Ben,
I see someone from the CoronaWhy project is experimenting with CoreNLP from Stanford. I think it makes a lot of sense to adopt a mature package like that, especially if the actual goal is focused on extraction of meaning. Deconstructing human language is hard enough!
Below is experimenting with a sibling package from Stanford, Stanza. Installation is trivial:
$ pip install stanza
$ python
Python 3.7.5 (default, Nov 7 2019, 10:50:52)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> nlp = stanza.Pipeline('en')
# Let's parse some scientific text.
# I noticed that my GPU spins-up when this command is run. Pretty cool!
>>> doc = nlp("Angiotensin-converting enzyme 2 (ACE2) is the cellular receptor for severe acute respiratory syndrome–coronavirus (SARS-CoV) and the new coronavirus (SARS-CoV-2) that is causing the serious coronavirus disease 2019 (COVID-19) epidemic. Here, we present cryo–electron microscopy structures of full-length human ACE2 in the presence of the neutral amino acid transporter B0AT1 with or without the receptor binding domain (RBD) of the surface spike glycoprotein (S protein) of SARS-CoV-2, both at an overall resolution of 2.9 angstroms, with a local resolution of 3.5 angstroms at the ACE2-RBD interface. The ACE2-B0AT1 complex is assembled as a dimer of heterodimers, with the collectrin-like domain of ACE2 mediating homodimerization. The RBD is recognized by the extracellular peptidase domain of ACE2 mainly through polar residues. These findings provide important insights into the molecular basis for coronavirus recognition and infection.")
# Now we can get a graph of the text. (Trimmed output.)
>>> doc.sentences[0].print_dependencies()
('Angiotensin', '12', 'nsubj')
('-', '1', 'punct')
('converting', '1', 'acl')
('enzyme', '3', 'obj')
('2', '4', 'nummod')
('(', '7', 'punct')
('ACE2', '4', 'appos')
(')', '7', 'punct')
('is', '12', 'cop')
('the', '12', 'det')
('cellular', '12', 'amod')
('receptor', '0', 'root')
('for', '17', 'case')
('severe', '17', 'amod')
('acute', '17', 'amod')
('respiratory', '17', 'amod')
('syndrome', '12', 'nmod')
('–', '17', 'punct')
('coronavirus', '17', 'appos')
('(', '21', 'punct')
('SARS', '19', 'appos')
('-', '21', 'punct')
('CoV', '21', 'appos')
(')', '21', 'punct')
('and', '28', 'cc')
('the', '28', 'det')
('new', '28', 'amod')
('coronavirus', '19', 'conj')
('(', '30', 'punct')
('SARS', '28', 'appos')
('-', '30', 'punct')
('CoV', '30', 'appos')
('-', '30', 'punct')
('2', '32', 'appos')
(')', '30', 'punct')
('that', '38', 'nsubj')
('is', '38', 'aux')
('causing', '28', 'acl:relcl')
('the', '49', 'det')
('serious', '49', 'amod')
('coronavirus', '42', 'compound')
# [etc]
Stanford has a ton of papers on related topics, including reference material like this for the dependency information. I guess none of this is news to you, but I thought I'd throw this out to first see where your interests are.
I haven't yet thought about how to transform a graph like that into a BQM. It seems clear there would be many nodes and many types of nodes. Because of scale, it would definitely be a job for our Hybrid Solving Service.
Best,
Marcus
Please sign in to leave a comment.