dora package

Submodules

dora.api module

class dora.api.DataExplorer(config=<dora.config.Config object>)

Bases: object

Main instantiated class for Dora package

Contains properties for each submodule in the package. Init with a Config object.

benchmarks

SqlSource – submodule to query api benchmarks

categories

AsterixSource – submodule to query categories

customers

SqlSource – submodule to query customers

orders

SqlSource – submodule to query orders

products

SqlSource – submodule to query products

recommendations

SqlSource – submodule to query ML model output

reviews

SqlSource, SolrSource – submodule to query reviews

dora.benchmarks module

class dora.benchmarks.Benchmarks(sql_config)

Bases: dora.datasources.SqlSource

clientActivity(aggregate=False, min_date='1900-1-1', max_date=None, clientid_filter=None, sample_size=100)

Hourly aggregate API activity by client.

Parameters:
  • aggregate (bool) – individual results or aggregate all clients
  • min_date (string) – optional. date. Limits the search result timeframe .
  • max_date (string) – optional. date. Limits the search result timeframe.
  • clientid_filter (None or list) – optional. List of clientids to query.
  • sample_size (int) – optional. Percentage of the benchmarks the query will run over.
Returns:

columns (list of str): [‘clientid’,’hour’,’api_calls’]

results (list of tuple(str,int,int))

Return type:

QueryResponse

insert(function, args, kwargs, start, end, is_cached, client_id)

Log an API call to SqlSource.

statsByFunction(min_date='1900-1-1', max_date=None, function_filter=None, sample_size=100)

Execution stats for API functions.

Stats aggregated by function_name and is_cached.

Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe .
  • max_date (string) – optional. date. Limits the search result timeframe.
  • function_filter (None or list) – optional. List of function names to query.
  • sample_size (int) – optional. Percentage of the benchmarks the query will run over.
Returns:

columns (list of str): [‘function_name’,’is_cached’,’avg_runtime_seconds’,’total_runtime_seconds’,’invocations’]

results (list of tuple(str,bool,float,float,int))

Return type:

QueryResponse

dora.categories module

class dora.categories.Categories(asterix_config)

Bases: dora.datasources.AsterixSource

childrenOf(node_id)

Get the direct children categories of node_id.

Parameters:node_id (int) –
Returns:columns (list of str): [‘nodeID’,’level’,’child_node_id’]

results (list of tuple(int,int,int)

Return type:QueryResponse
parentOf(node_id)

Get the direct parent category of node_id.

Parameters:node_id (int) –
Returns:columns (list of str): [‘nodeID’,’level’,’parent_node_id’]

results (list of tuple(int,int,int)

Return type:QueryResponse
search(string_search='')

Case-sensitive search of all category levels.

Parameters:string_search (str) –
Returns:columns (list of str): [‘classification’,’nodeid’,’level_0’,’level_1’,’level_2’,’level_3’,’level_4’,’level_5’]

results (list of tuple(str,int,str,str,str,str,str,str)

Return type:QueryResponse

dora.config module

class dora.config.AsterixConfig(d)

Bases: dora.config.ConfigType

cache_ttl

Cache time-to-live in seconds

collection

Collection to query

host

AsterixDB host API url.

ie, http://localhost:19002

class dora.config.Config(path=None)

Bases: object

class dora.config.ConfigType(d)

Bases: object

Wrapper type for config property accessibility.

get_property(property_name)
class dora.config.SolrConfig(d)

Bases: dora.config.ConfigType

cache_ttl

Cache time-to-live in seconds

host

Solr host API url.

ie, http://localhost:8983/solr/bookstore_pr/

class dora.config.SqlConfig(d)

Bases: dora.config.ConfigType

cache_ttl

Cache time-to-live in seconds

connection_string

SQL Connection string. Should be compatible with psycopg2.

random_seed

Integer for query sampling. Using the same seed will result in repeatable queries.

dora.customers module

class dora.customers.Customers(sql_config)

Bases: dora.datasources.SqlSource

clusterCustomers(feature_set=None, n_clusters=8, algorithm='auto', init='k-means++', cluster_on=['numorders', 'gender', 'totalpop', 'totalspent', 'zipcode', 'medianage', 'totalmales', 'totalfemales'], scale=False)

Clusters the customers together based on cluster_on parameter.

Parameters:
  • feature_set (QueryResponse or dictionary) – optional. must have keys ‘results’ and ‘columns’.
  • that will be clustered. (Data) –
  • num_clusters (int) – optional. default=8 The number of clusters to form as well as the number of centroids to generate.
  • algorithm (string) – optional. “auto”, “full” or “elkan”, default=”auto”. K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
  • init (string) – optional. {‘k-means++’, ‘random’ or an ndarray}. Method for initialization, defaults to ‘k-means++’ ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. ‘random’: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
  • cluster_on (list of str) – column names to use as cluster features
  • scale (bool) – scale features
Returns:

columns (list of str): [‘numorders’, ‘gender’, ‘totalpop’, ‘totalspent’, ‘zipcode’, ‘customermatchedid’,’medianage’, ‘totalmales’, ‘totalfemales’, ‘householdid’, ‘firstname’, ‘numcustomerid’,’cluster’, ‘customerids’ ]

results (list of tuple(int,str,int,int,float,int,int,float,int,str,int,int,list(int)))

numOrders is the number of times a customer has purchased a book. gender is the gender of the customer. zipcode identifiies the customers location. TotalPop is the total population for the zipcode. MedianAge is the median age of the population for the zipcode. customermatchedid is the number of customerids that matched with the customer. TotalMales is the total number of males of the population for the zipcode. TotalFemales is the totalnumber of females of the population for the zipcode. TotalSpent is the total amount the customer has spent on books. householdid is the customer’s hosuehold identification. firstname is the customer’s name.numCustomerid is the number of customerids per customer. cluster is the label of the cluster the customer belongs to. customerids is a list of all the customerids that correspond to that customer.

Return type:

QueryResponse

idsForCustomer(customermatchedids=[])

Find all customerid for customermatchedid

Parameters:customermatchedids (list) – optional. list of customermatchedids to filter on.
Returns:columns (list of str): [‘customermatchedid’, ‘customerid’]

results (list of tuple(int,int))

Return type:QueryResponse
membersOfHousehold(householdID, sample_size=100)

For each household, find the customerid, firstname, and gender for each member.

Parameters:
  • householdID (int) – The householdID that the members will be found of.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘customerid’, ‘firstname’, ‘gender’]

results (list of tuple(int,str,str))

customerid is the unique customer id. firstname is the name of the customer. gender is the gender of the customer.

Return type:

QueryResponse

productsByHousehold(householdID, min_date='1900-1-1', max_date=None, sample_size=100)

For each household, all of the products that have been purchased.

Parameters:
  • householdID (int) – The householdID that the members will be found of.
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date. Limits the search result timeframe.
  • sample_size – optional. Percentage of the data the query will run over.
statsByCustomer(min_date='1900-1-1', max_date=None, householdid=[], sample_size=100)
For each customer, find the number of books orders, gender, zipcode, household,
first name, and total spend on books.
Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date. Limits the search result timeframe.
  • householdid (tuple) – optional. householdids that will be excluded from the query results.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘numOrders’, ‘gender’, ‘zipcode’, ‘TotalPop’, ‘MedianAge’, ‘TotalMales’, ‘TotalFemales’,’TotalSpent’,’householdid’, ‘firstname’,’numCustomerid’ ]

results (list of tuple(int,str,int,int,float,int,int,float,int,str,int))

numOrders is the number of times a customer has purchased a book. gender is the gender of the customer. zipcode identifiies the customers location. TotalPop is the total population for the zipcode. MedianAge is the median age of the population for the zipcode. TotalMales is the total number of males of the population for the zipcode. TotalFemales is the total number of females of the population for the zipcode. TotalSpent is the total amount the customer has spent on books. householdid is the customer’s hosuehold identification. firstname is the customer’s name. numCustomerid is the number of customerids per customer.

Return type:

QueryResponse

statsByHousehold(min_date='1900-1-1', max_date=None, sample_size=100)

For each household find the total amount spent, the total number of order, the orderdate of thefirst and last order, the time spent as customer and the time since the last order.

Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date. Limits the search result timeframe.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘HouseholdID’, ‘TotalSpent’, ‘TotalOrders’, ‘first_order’, ‘last_order’, ‘time_as_customer’,’time_since_last_order’]

results (list of tuple(int,float,int,date,date,interval,interval))

householdid is the unique household id. TotalSpent is the amount of money spenton all order. TotalOrders is the number of orders that have been made by that household. first_order is the time when the first order was made. last_order is the time since the lastorder. time_as_customer is the time the members of the household has been customers.

Return type:

QueryResponse

dora.datasources module

class dora.datasources.AsterixConnection(server)

Bases: object

query(statement, pretty=False, client_context_id=None)
class dora.datasources.AsterixQueryResponse(raw_response)

Bases: object

class dora.datasources.AsterixSource(asterix_config)

Bases: dora.datasources.Cacheable

class dora.datasources.Cacheable(ttl)

Bases: object

class dora.datasources.QueryResponse(columns, results)

Bases: object

Main response object for API queries

to_csv(path)

convert response results to a csv with header, saved to path

to_pandas()

convert response results to pandas dataframe, all default pandas settings

vis

QueryVis object for visualization convenience

class dora.datasources.SolrSource(solr_config)

Bases: dora.datasources.Cacheable

class dora.datasources.SqlSource(sql_config)

Bases: dora.datasources.Cacheable

dora.logger module

dora.logger.log(f)

Decorator function to log API function calls to local log file and to SqlSource.

dora.orders module

class dora.orders.Orders(sql_config)

Bases: dora.datasources.SqlSource

statsByProduct(min_date='1900-1-1', max_date=None, sample_size=100)

Produces statistics for each product.

Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date.Limits the search result timeframe.
  • sample_size (int) – optional. Percentage of the orders the query will run over.
Returns:

columns (list of str): [‘productid’, ‘asin’, ‘num_orders’,

’first_order’, ‘last_order’, ‘days_on_sale’, ‘unitprice_min’,’unitprice_max’, ‘uniteprice_avg’, ‘numunits_min’, ‘numunits_max’, ‘numunits_avg’, ‘numunites_sum’, ‘totalprice_min’, ‘totalprice_max’, ‘totalprice_avg’, ‘totalprice_sum’]

results (list of tuple(str,int,int,float))

productid is the unique identifier for the product. asin is the asin for the product. first_order is the date of the first product being shipped. last_order is the last day an order was shipped. days_on_sale is the total number of days the product was on sale. unitprice_min is the minimum price the product. unitprice_max is the maximum price for the product. unitprice_avg is the average price for the product. numunits_min is the minimum number of times the book was purchased in one order. numunits_max is the largest number of times the book was purchased in one order. numunits_avg is the average number of times the book was purchased in the same order. numunits_sum is the number of times the book was purchased.

Return type:

QueryResponse

statsByZipcode(min_date='1900-1-1', max_date=None, sample_size=100)

For each zipcode, determine the number of orders and the total amount of money has been spent.

Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date. Limits the search result timeframe.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘countyname’, ‘countypop’, ‘NumofOrders’, ‘TotalSpending’]

results (list of tuple(str,int,int,float))

countyname is the name of the county the zipcode corresonds to. countypop is the population of the county. NumofOrders is the number of orders that have been purchased by customers in the zipcode. TotalSpending is the amount of money customers in the zipcode have purchased.

Return type:

QueryResponse

dora.products module

class dora.products.Products(sql_config)

Bases: dora.datasources.SqlSource

byCategory(nodeid)

Retrieve products by category

Parameters:nodeid (int) – category nodeid to search for
Returns:columns (list of str): [‘productid’]

results (list of tuple(int))

Return type:QueryResponse
clusterProducts(feature_set=None, n_clusters=8, algorithm='auto', cluster_on=['numorders', 'avgrating', 'category', 'days_on_sale', 'spring_sales', 'summer_sales', 'fall_sales', 'winter_sales'], random_state=None, asin=None, scale=False, PCA=False, n_components=8)

Clusters the books together using KMeans clustering utilizing the clusterQuery results as the features (num_orders, avgrating, category, and days_on_sale).

Parameters:
  • num_clusters (int) – optional. default=8 The number of clusters to form as well as the number of centroids to generate.
  • algorithm (string) – optional. “auto”, “full” or “elkan”, default=”auto”. K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
  • random_state (int) – optional. int used to genderate random number.
  • asin (tuple(string)) – optional. asins will be the centers of the kmeans clustering.
Returns:

columns (list of str): [‘productid’, ‘asin’, ‘y_pred’]

results (list of tuple(int, str, int))

productid is the products unqiue identifier. asin is the identification of the book. y_pred is the label of the clsuter that the product belongs to.

Return type:

QueryResponse

coPurchases(asin, min_date='1900-1-1', max_date=None, sample_size=100)

For each given book, find all the books purchased in the same order as the given book and the number of times that book was purchased.

Parameters:
  • asin (list) – required. book asin ids. Determines the books that coPurchases will be searched for.
  • min_date (string) – optional. date. Limits the search result timeframe.
  • max_date (string) – optional. date. Limits the search result timeframe.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘asin’, ‘numPurch’]

results (list of tuple(str,int))

asin is the identification of the book that was purchased in the same order as one of the input bools. numPurch is the number of times the book the book was purchased.

Return type:

QueryResponse

priceDistribution(bins=5)

Produces statistics for each product.

Parameters:bins (int or list of tuple(int,int)) – If bins is an int, prices are bin’d with steps (max price-min price)/bin. A list of tuples can be used to create you own bin limits.
Returns:columns (list of str): [‘count_<bin min>_to_<bin max>’…]

results (list of tuple(int, int...)))

Return type:QueryResponse
ratingsDistribution(min_date='1900-1-1', max_date=None, asin=[], sample_size=100)

For each product, determine the how many 1, 2, 3, 4, and 5 star reviews the product received.

Parameters:
  • min_date (string) – optional. date. inclusive bottom limit of reviewTime
  • max_date (string) – optional. date. inclusive upper limit of reviewTime
  • asin (list) – optional. The asins of the products the rating distrubtion will be produced for. Defaults to returning distributions for all asins
  • sample_size (int) – optional. Percentage of the reviews the query will run over.
Returns:

columns (list of str): [asin, productid, ‘one_star_votes’, ‘two_star_votes’, ‘three_star_votes’,’four_star_votes’, ‘five_star_votes’]

results (list of tuple(str,int,int,int,int,int,int))

asin is the label for the book. productid is the unique identifier for the product. one_star_votes is the number of one star reveiws the book received. two_star_votes is the number of two star reveiws the book received. three_star_votes is the number of three star reveiws the book received. four_star_votes is the number of four star reveiws the book received. five_star_votes is the number of five star reveiws the book received.

Return type:

QueryResponse

seasonalOrderDistribution(asin=[])

Produces statistics for each product.

Parameters:asin (list of str) – asin product filter
Returns:columns (list of str): [‘asin’,’productid’,’spring_sales’,’summer_sales’,’fall_sales’,’winter_sales’]

results (list of tuple(str, int, int, int, int, int)))

Return type:QueryResponse
statsByProduct(asin=[], min_date='1900-1-1', max_date=None, sample_size=100)

For each book the product id, asin, the number of times it was purchased, the average star rating, the product category, and days the product has been on sale is returned.

Parameters:
  • min_date (string) – optional. date. Limits the search result timeframe .
  • max_date (string) – optional. date. Limits the search result timeframe.
  • sample_size (int) – optional. Percentage of the data the query will run over.
Returns:

columns (list of str): [‘productid’, ‘asin’, ‘num_orders’, ‘avgrating’, ‘category’, ‘days_on_sale’]

results (list of tuple(int,str,int,float,int,int))

productid is the products unqiue identifier. asin is the identification of the book. num_orders counts the number of times the book has been purchased. avgrating is the average star rating of the book based on the user reviews. category is the product category that the book belongs to. days_on_sale is the number of days the book has been on sale.

Return type:

QueryResponse

dora.recommendations module

class dora.recommendations.Recommendations(sql_config)

Bases: dora.datasources.SqlSource

insert(customerid, productid, rank, timestamp)
statsByProduct(productids=[], min_date='1900-1-1', max_date=None, sample_size=100)

dora.reviews module

class dora.reviews.Reviews(sql_config, solr_config)

Bases: dora.datasources.SqlSource, dora.datasources.SolrSource

asinByTerms(terms)

Given a list of words, find ASIN with the highest Solr score from its reviews.

Solr score is configurable by facet.method parameter. Default ‘fc’ option used. https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Thefacet.methodParameter

Parameters:terms (list) – list of strings
Returns:asin is a product identifier. score is the metric Solr is configured for (see above).
Return type:tuple
termsByAsin(asin)

Given an ASIN, find the terms with the highest Solr score from its reviews.

Solr score is configurable by facet.method parameter. Default ‘fc’ option used. https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Thefacet.methodParameter

Parameters:asin (string) – required. product identifier. Analyze terms for this product’s reviews
Returns:term is a token as defined by Solr. score is the metric Solr is configured for (see above).
Return type:tuple(term, score)

dora.vis module

class dora.vis.QueryVis(columns, results)

Bases: object

bar(title=None, cols=None)
line(x, cols=None, title=None)
scatter(x=None, y=None, z=None)
class dora.vis.VisExplorer

Bases: object

bar(query_response, x=None, y=None, z=None)
line(query_response, x=None, y=None, z=None)
scatter(query_response, x=None, y=None, z=None)

Module contents