Introduction
Problem
Accessing a database to get specific information can be a stressful and time-consuming task if one doesn’t have the much-needed technical skills.
Solution
- Medical domain-specific knowledge-based QA system [1], which can answer the queries related to sales and marketing of drug from a given database.
- Provide real-time responses to the common business queries of any employee.
- Quick access and easy usage with the help of a messaging service like Facebook Messenger or Slack.
Methodology
Fig. 1. presents the basic architecture of my QA system. Firstly, the user submits a query to the QA Engine via Slack Interface. The submitted query is pre-processed and is sent to the NLP Engine. The NLP Engine extracts keywords or Named Entities [2], from the query. Then the extracted entities are forwarded to the Data Engine for retrieval of results. The retrieved results are sent back to the Slack Interface and displayed to the user.
1. User Query
A user can submit multiple queries to the QA system via Slack, one at a time. Slacker [3] library was used to connect Python to Slack. Table 1 below shows some of the possible business questions which a user can ask, along with the type of functionality they are testing. Similarly, Table 2 shows some of the technical questions and their predefined responses.
2. Query pre-processing
The pre-processing of the user query was done in three steps: Noise Removal, Lowercasing, and Normalization.
- During Noise Removal, the punctuations were removed from the query.
- During Lowercasing, all words in the query were converted to lowercase.
- During Normalization, the terms in query were transformed into standard form as in Data Engine. Table 3 given below shows some examples.
3. Named Entity Recognition (NER)
I have trained a blank spaCy [4] model using annotated training data to recognize four custom entities from the user query: product, country/region, quarter, and year. Fig. 2. presents the black box of the model.
The model was trained on 92 sentences. After training it for 20 epochs with dropout 0.2 and SGD optimizer, the model gave a training loss of 2.06–8.
4. Spell check
Then I applied the edit distance [5] technique, to rectify spelling errors in extracted entities.
The entry in the Data Engine column (with corresponding entity tag) having the minimum edit distance with the entity, is considered to be the right spelling of the entity. The minimum edit distance of an entity must be less than 2, for it to get transformed.
5. Retrieve Data from Data Engine
The translated entities are mapped to entries in the database to fetch data in the form of table and/or bar plot(s). The results are then sent back to Slack and displayed to the user.
Results
Fig. 3. and Fig. 4. shows the output to some of the Business Type and Technical Type questions, respectively.
Learnings
1. Technical Learnings
- Select important features and engineer new features from existing datasets.
- Custom NER for given data.
- Text data manipulation and processing techniques.
- Integration of chatbot with a messaging service like Slack.
2. Other Learnings
- Function effectively as members of an organization.
- Write Minutes of Meeting (MoM).
- Write Status Update e-mail at the end of the day.
- Develop a timeline for a project.
Conclusion and Future Work
The goal of this work was to build a QA system in the medical domain which can provide real-time responses to the common business queries of any employee in a healthcare company by removing unwanted dependencies on the analytics teams. The QA system makes use of advanced NLP techniques like NER and text pre-processing, etc. to understand and answer complex user queries.
My future goal is to work on Boolean questions expecting a yes/no answer and other types of factual questions. Another future goal is to develop augmented intelligence that identifies the type of question and gives more insights around that question.
References
- Terol, Rafael M., Patricio Martínez-Barco, and Manuel Palomar. “A knowledge based method for the medical question answering problem.” Computers in biology and medicine 37, no. 10 (2007): 1511–1521.
- Mansouri, Alireza, Lilly Suriani Affendey, and Ali Mamat. “Named entity recognition approaches.” International Journal of Computer Science and Network Security 8, no. 2 (2008): 339–344.
- https://pypi.org/project/slacker/
- https://spacy.io/usage/training\#ner
- Ristad, Eric Sven, and Peter N. Yianilos. “Learning string-edit distance.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20, no. 5 (1998): 522–532.
I would like to thank D Cube Analytics for this wonderful opportunity !!