search_analysis.tools package¶

Module contents¶

class search_analysis.tools.ComparisonTool(host, qry_rel_dict, eval_obj_1=None, eval_obj_2=None, fields=['text', 'title'], index_1=None, index_2=None, name_1='approach_1', name_2='approach_2', size=20, k=20)[source]¶

Bases: object

calculate_difference(condition='fscore', dumps=False)[source]¶

Calculates the difference per query for the given condition.

Parameters

condition – string “fscore”, “precision” or “recall”
dumps – True or False if True it returns json.dumps, if False saves to object variable

Returns

json with value differences

get_disjoint_sets(distribution, highest=False)[source]¶

Returns the disjoint sets of the given distribution.

Parameters

distribution – str distribution to return; possible arguments are ‘false_positives’ and ‘false_negatives’
highest – True or False if True it only returns the set with the highest count of disjoints

Returns

Ordered_results: OrderedDict disjoint lists for each approach in a dictionary for each query regarding the distribution

get_specific_comparison(query_id, doc_id, fields=['text', 'title'])[source]¶

Function to get position, highlights and scores for a specific query and a specific query in comparison.

:arg query_id :arg doc_id: int

doc id that should be looked at

Parameters: fields – list list of fields that should be searched on
Returns

Json.dumps(comp_dict): dict dumped as json filled with comparison for given query and doc id

visualize_condition(queries=None, eval_objs=None, conditions=['precision', 'recall', 'fscore'], download=False, path_to_file='./save_vis_condition.svg')[source]¶

Visualizes conditions in comparison for given queries and given approaches.

Parameters

queries – int or list or None if None it searches with all queries
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
conditions – list conditions that should be printed; by default precision, recall and f1-score are used
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_distributions(queries=None, eval_objs=None, distributions=['true_positives', 'false_positives', 'false_negatives'], download=False, path_to_file='./save_vis_distributions.svg')[source]¶

Visualizes distributions in comparison for given queries and given approaches.

Parameters

queries – int or list or None if None it searches with all queries
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
distributions – list distributions that should be printed; by default tp, fp and fn are used
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_explanation(query_id, doc_id, fields=['text', 'title'], eval_objs=None, download=False, path_to_file='./save_vis_explaination.svg')[source]¶

Visualize in comparison which words were better scored using approach, specific query and a specific document.

Parameters

queries – int or list or None if None it searches with all queries
doc_id – int id of document that should be explained
fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched
eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object
download – True or False saves the plot as svg; by default False which leads to not saving the visualization
path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_explanation_csv(query_id, doc_id, path_to_save_to, fields=['text', 'title'], decimal_separator=',', eval_objs=None)[source]¶

Saves explanation table to csv

Parameters

query_id – int query id of query that should be explained
doc_id – int id of document that should be explained
path_to_save_to – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_that.csv’
fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched
decimal_separator – string choose a decimal separator; by default it’s a comma, but for english you might prefer a dot
eval_objs – list or None exactly two EvaluationObjs; if None it uses the ones from the ComparisonTool

Returns

csv file to feed it to program to create graphs, e.g. Google Sheets or Microsoft Excel

class search_analysis.tools.EvaluationObject(host, query_rel_dict, index, name, verified_certificates=False)[source]¶

Bases: object

count_distribution(distribution, distribution_json, dumps=False, k=20)[source]¶

Counts given distribution per query, relevant documents and calculates percentages given the relevant documents.

Parameters

distribution – string ‘true_positives’, ‘false_positives’ or ‘false_negatives’
distribution_json – json json with all the distributions needed; e.g. EvaluationObject.true_positives
dumps – True or False if True it returns json.dumps, if False it returns json
k – int size of k top search results

Returns

sorted_counts

json counted distribution per query, as a sum and as a percentage

explain_query(query_id, doc_id, fields=['text', 'title'], dumps=True)[source]¶

Returns an Elasticsearch explanation for given query and document.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Parameters

query_id – int id of query that should be explained
doc_id – int id of document that should be explained
fields – list of str fields that should be searched on
dumps – True or False True by default, if False it won’t convert dict to json

Returns

json or dict explaining query and document match

get_false_negatives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶

Calculates false negatives from given search queries.

Parameters

searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json

Returns

False negatives: json

get_false_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶

Calculates false positives from given search queries.

Parameters

searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json

Returns

False positives: json

get_fscore(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False, factor=1)[source]¶

Calculates f-score for every search query given.

Parameters

searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable
factor – int can be used to weight the F score, default is 1

Returns

json with F-score values

get_precision(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶

Calculates precision for every search query given.

Parameters

searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable

Returns

json with Precision values

get_recall(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶

Calculates recall for every search query given.

Parameters

searched_queries – int or list or None searched queries; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it saves to object variable

Returns

json with Recall values

get_true_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]¶

Calculates true positives from given search queries.

Parameters

searched_queries – int or list or None query ids; if None it searches with all queries
fields – list of str fields that should be searched on
size – int search size
k – int top results that should be returned from Elasticsearch
dumps – True or False if True it returns json.dumps, if False it returns json

Returns

True positives: json