Usage

To use Search Analysis in a project:

import search_analysis

Evaluation Object

class search_analysis.EvaluationObject(host, query_rel_dict, index, name, verified_certificates=False)[source]
count_distribution(distribution, distribution_json, dumps=False, k=20)[source]

Counts given distribution per query, relevant documents and calculates percentages given the relevant documents.

Parameters
  • distribution – string ‘true_positives’, ‘false_positives’ or ‘false_negatives’

  • distribution_json – json json with all the distributions needed; e.g. EvaluationObject.true_positives

  • dumps – True or False if True it returns json.dumps, if False it returns json

  • k – int size of k top search results

Returns

sorted_counts

json counted distribution per query, as a sum and as a percentage

explain_query(query_id, doc_id, fields=['text', 'title'], dumps=True)[source]

Returns an Elasticsearch explanation for given query and document.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Parameters
  • query_id – int id of query that should be explained

  • doc_id – int id of document that should be explained

  • fields – list of str fields that should be searched on

  • dumps – True or False True by default, if False it won’t convert dict to json

Returns

json or dict explaining query and document match

get_false_negatives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]

Calculates false negatives from given search queries.

Parameters
  • searched_queries – int or list or None query ids; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it returns json

Returns

False negatives

json

get_false_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]

Calculates false positives from given search queries.

Parameters
  • searched_queries – int or list or None query ids; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it returns json

Returns

False positives

json

get_fscore(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False, factor=1)[source]

Calculates f-score for every search query given.

Parameters
  • searched_queries – int or list or None searched queries; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it saves to object variable

  • factor – int can be used to weight the F score, default is 1

Returns

json with F-score values

get_precision(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]

Calculates precision for every search query given.

Parameters
  • searched_queries – int or list or None searched queries; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it saves to object variable

Returns

json with Precision values

get_recall(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]

Calculates recall for every search query given.

Parameters
  • searched_queries – int or list or None searched queries; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it saves to object variable

Returns

json with Recall values

get_true_positives(searched_queries=None, fields=['text', 'title'], size=20, k=20, dumps=False)[source]

Calculates true positives from given search queries.

Parameters
  • searched_queries – int or list or None query ids; if None it searches with all queries

  • fields – list of str fields that should be searched on

  • size – int search size

  • k – int top results that should be returned from Elasticsearch

  • dumps – True or False if True it returns json.dumps, if False it returns json

Returns

True positives

json

Comparison Tool

class search_analysis.ComparisonTool(host, qry_rel_dict, eval_obj_1=None, eval_obj_2=None, fields=['text', 'title'], index_1=None, index_2=None, name_1='approach_1', name_2='approach_2', size=20, k=20)[source]
calculate_difference(condition='fscore', dumps=False)[source]

Calculates the difference per query for the given condition.

Parameters
  • condition – string “fscore”, “precision” or “recall”

  • dumps – True or False if True it returns json.dumps, if False saves to object variable

Returns

json with value differences

get_disjoint_sets(distribution, highest=False)[source]

Returns the disjoint sets of the given distribution.

Parameters
  • distribution – str distribution to return; possible arguments are ‘false_positives’ and ‘false_negatives’

  • highest – True or False if True it only returns the set with the highest count of disjoints

Returns

Ordered_results

OrderedDict disjoint lists for each approach in a dictionary for each query regarding the distribution

get_specific_comparison(query_id, doc_id, fields=['text', 'title'])[source]

Function to get position, highlights and scores for a specific query and a specific query in comparison.

:arg query_id :arg doc_id: int

doc id that should be looked at

Parameters

fields – list list of fields that should be searched on

Returns

Json.dumps(comp_dict)

dict dumped as json filled with comparison for given query and doc id

visualize_condition(queries=None, eval_objs=None, conditions=['precision', 'recall', 'fscore'], download=False, path_to_file='./save_vis_condition.svg')[source]

Visualizes conditions in comparison for given queries and given approaches.

Parameters
  • queries – int or list or None if None it searches with all queries

  • eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object

  • conditions – list conditions that should be printed; by default precision, recall and f1-score are used

  • download – True or False saves the plot as svg; by default False which leads to not saving the visualization

  • path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_distributions(queries=None, eval_objs=None, distributions=['true_positives', 'false_positives', 'false_negatives'], download=False, path_to_file='./save_vis_distributions.svg')[source]

Visualizes distributions in comparison for given queries and given approaches.

Parameters
  • queries – int or list or None if None it searches with all queries

  • eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object

  • distributions – list distributions that should be printed; by default tp, fp and fn are used

  • download – True or False saves the plot as svg; by default False which leads to not saving the visualization

  • path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_explanation(query_id, doc_id, fields=['text', 'title'], eval_objs=None, download=False, path_to_file='./save_vis_explaination.svg')[source]

Visualize in comparison which words were better scored using approach, specific query and a specific document.

Parameters
  • queries – int or list or None if None it searches with all queries

  • doc_id – int id of document that should be explained

  • fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched

  • eval_objs – list EvaluationObjs; if None it uses the ones already implemented in the ComparisonTool object

  • download – True or False saves the plot as svg; by default False which leads to not saving the visualization

  • path_to_file – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_this.svg’

Prints

visualization via matplot as plt.show()

visualize_explanation_csv(query_id, doc_id, path_to_save_to, fields=['text', 'title'], decimal_separator=',', eval_objs=None)[source]

Saves explanation table to csv

Parameters
  • query_id – int query id of query that should be explained

  • doc_id – int id of document that should be explained

  • path_to_save_to – string path and filename the visualization should be saved to, e.g. ‘./myfolder/save_that.csv’

  • fields – list fields that should be searched, by default ‘text’ and ‘title’ are searched

  • decimal_separator – string choose a decimal separator; by default it’s a comma, but for english you might prefer a dot

  • eval_objs – list or None exactly two EvaluationObjs; if None it uses the ones from the ComparisonTool

Returns

csv file to feed it to program to create graphs, e.g. Google Sheets or Microsoft Excel