search_analysis.tools package¶
Module contents¶
- class search_analysis.tools.ComparisonTool(host, qry_rel_dict, eval_obj_1=None, eval_obj_2=None, fields=['text', 'title'], index_1=None, index_2=None, name_1='approach_1', name_2='approach_2', size=20, k=20)[source]¶
Bases:
object- calculate_difference(condition='fscore', dumps=False)[source]¶
Calculates the difference per query for the given condition.
- Parameters
condition – string “fscore”, “precision” or “recall”
dumps – True or False if True it returns json.dumps, if False saves to object variable
- Returns
json with value differences
- get_disjoint_sets(distribution, highest=False)[source]¶
Returns the disjoint sets of the given distribution.
- Parameters
distribution – str distribution to return; possible arguments are ‘false_positives’ and ‘false_negatives’
highest – True or False if True it only returns the set with the highest count of disjoints
- Returns
- Ordered_results
OrderedDict disjoint lists for each approach in a dictionary for each query regarding the distribution
- get_specific_comparison(query_id, doc_id, fields=['text', 'title'])[source]¶
Function to get position, highlights and scores for a specific query and a specific query in comparison.
:arg query_id :arg doc_id: int
doc id that should be looked at
- Parameters
fields – list list of fields that should be searched on
- Returns
- Json.dumps(comp_dict)
dict dumped as json filled with comparison for given query and doc id
- visualize_condition(queries=None, eval_objs=None, conditions=['precision', 'recall', 'fscore'], download=False, path_to_file='./save_vis_condition.svg')[source]¶
Visualizes conditions in comparison for given queries and given approaches.
- Parameters
queries – int or list or None if None it searches with all queries
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
conditions – list conditions that should be printed; by default precision, recall and f1-score are used
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’
- Prints
visualization via matplot as plt.show()
- visualize_distributions(queries=None, eval_objs=None, distributions=['true_positives', 'false_positives', 'false_negatives'], download=False, path_to_file='./save_vis_distributions.svg')[source]¶
Visualizes distributions in comparison for given queries and given approaches.
- Parameters
queries – int or list or None if None it searches with all queries
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
distributions – list distributions that should be printed; by default tp, fp and fn are used
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’
- Prints
visualization via matplot as plt.show()
- visualize_explanation(query_id, doc_id, fields=['text', 'title'], eval_objs=None, download=False, path_to_file='./save_vis_explaination.svg')[source]¶
Visualize in comparison which words were better scored using approach, specific query and a specific document.
- Parameters
queries – int or list or None if None it searches with all queries
doc_id – int id of document that should be explained
fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’
- Prints
visualization via matplot as plt.show()
- visualize_explanation_csv(query_id, doc_id, path_to_save_to, fields=['text', 'title'], decimal_separator=',', eval_objs=None)[source]¶
Saves explanation table to csv
- Parameters
query_id – int query id of query that should be explained
doc_id – int id of document that should be explained
path_to_save_to – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_that.csv’
fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched
decimal_separator – string choose a decimal separator; by default it’s a comma, but for english you might prefer a dot
eval_objs – list or None exactly two EvaluationObjs; if None it uses the ones from the ComparisonTool
- Returns
csv file to feed it to program to create graphs, e.g. Google Sheets or Microsoft Excel
- class search_analysis.tools.EvaluationObject(host, query_rel_dict, index, name, verified_certificates=False)[source]¶
Bases:
object- count_distribution(distribution, distribution_json, dumps=False, k=20)[source]¶
Counts given distribution per query, relevant documents and calculates percentages given the relevant documents.
- Parameters
distribution – string ‘true_positives’, ‘false_positives’ or ‘false_negatives’
distribution_json – json json with all the distributions needed; e.g. EvaluationObject.true_positives
dumps – True or False if True it returns json.dumps, if False it returns json
k – int size of k top search results
- Returns
- sorted_counts
json counted distribution per query, as a sum and as a percentage
- explain_query(query_id, doc_id, fields=['text', 'title'], dumps=True)[source]¶
Returns an Elasticsearch explanation for given query and document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html
- Parameters
query_id – int id of query that should be explained
doc_id – int id of document that should be explained
fields – list of str fields that should be searched on
dumps – True or False True by default, if False it won’t convert dict to json
- Returns
json or dict explaining query and document match
- get_false_negatives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶
Calculates false negatives from given search queries.
- Parameters
searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json
- Returns
- False negatives
json
- get_false_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶
Calculates false positives from given search queries.
- Parameters
searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json
- Returns
- False positives
json
- get_fscore(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False, factor=1)[source]¶
Calculates f-score for every search query given.
- Parameters
searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable
factor – int can be used to weight the F score, default is 1
- Returns
json with F-score values
- get_precision(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶
Calculates precision for every search query given.
- Parameters
searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable
- Returns
json with Precision values
- get_recall(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶
Calculates recall for every search query given.
- Parameters
searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable
- Returns
json with Recall values
- get_true_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶
Calculates true positives from given search queries.
- Parameters
searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json
- Returns
- True positives
json