dora package¶
Submodules¶
dora.api module¶
-
class
dora.api.
DataExplorer
(config=<dora.config.Config object>)¶ Bases:
object
Main instantiated class for Dora package
Contains properties for each submodule in the package. Init with a Config object.
-
benchmarks
¶ SqlSource – submodule to query api benchmarks
-
categories
¶ AsterixSource – submodule to query categories
-
customers
¶ SqlSource – submodule to query customers
-
orders
¶ SqlSource – submodule to query orders
-
products
¶ SqlSource – submodule to query products
-
recommendations
¶ SqlSource – submodule to query ML model output
-
reviews
¶ SqlSource, SolrSource – submodule to query reviews
-
dora.benchmarks module¶
-
class
dora.benchmarks.
Benchmarks
(sql_config)¶ Bases:
dora.datasources.SqlSource
-
clientActivity
(aggregate=False, min_date='1900-1-1', max_date=None, clientid_filter=None, sample_size=100)¶ Hourly aggregate API activity by client.
Parameters: - aggregate (bool) – individual results or aggregate all clients
- min_date (string) – optional. date. Limits the search result timeframe .
- max_date (string) – optional. date. Limits the search result timeframe.
- clientid_filter (None or list) – optional. List of clientids to query.
- sample_size (int) – optional. Percentage of the benchmarks the query will run over.
Returns: columns (
list
ofstr
): [‘clientid’,’hour’,’api_calls’]results (
list
oftuple(str,int,int))
Return type:
-
insert
(function, args, kwargs, start, end, is_cached, client_id)¶ Log an API call to SqlSource.
-
statsByFunction
(min_date='1900-1-1', max_date=None, function_filter=None, sample_size=100)¶ Execution stats for API functions.
Stats aggregated by function_name and is_cached.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe .
- max_date (string) – optional. date. Limits the search result timeframe.
- function_filter (None or list) – optional. List of function names to query.
- sample_size (int) – optional. Percentage of the benchmarks the query will run over.
Returns: columns (
list
ofstr
): [‘function_name’,’is_cached’,’avg_runtime_seconds’,’total_runtime_seconds’,’invocations’]results (
list
oftuple(str,bool,float,float,int))
Return type:
-
dora.categories module¶
-
class
dora.categories.
Categories
(asterix_config)¶ Bases:
dora.datasources.AsterixSource
-
childrenOf
(node_id)¶ Get the direct children categories of node_id.
Parameters: node_id (int) – Returns: columns ( list
ofstr
): [‘nodeID’,’level’,’child_node_id’]results (
list
oftuple(int,int,int)
Return type: QueryResponse
-
parentOf
(node_id)¶ Get the direct parent category of node_id.
Parameters: node_id (int) – Returns: columns ( list
ofstr
): [‘nodeID’,’level’,’parent_node_id’]results (
list
oftuple(int,int,int)
Return type: QueryResponse
-
search
(string_search='')¶ Case-sensitive search of all category levels.
Parameters: string_search (str) – Returns: columns ( list
ofstr
): [‘classification’,’nodeid’,’level_0’,’level_1’,’level_2’,’level_3’,’level_4’,’level_5’]results (
list
oftuple(str,int,str,str,str,str,str,str)
Return type: QueryResponse
-
dora.config module¶
-
class
dora.config.
AsterixConfig
(d)¶ Bases:
dora.config.ConfigType
-
cache_ttl
¶ Cache time-to-live in seconds
-
collection
¶ Collection to query
-
host
¶ AsterixDB host API url.
-
-
class
dora.config.
Config
(path=None)¶ Bases:
object
-
class
dora.config.
ConfigType
(d)¶ Bases:
object
Wrapper type for config property accessibility.
-
get_property
(property_name)¶
-
-
class
dora.config.
SolrConfig
(d)¶ Bases:
dora.config.ConfigType
-
cache_ttl
¶ Cache time-to-live in seconds
-
host
¶ Solr host API url.
-
-
class
dora.config.
SqlConfig
(d)¶ Bases:
dora.config.ConfigType
-
cache_ttl
¶ Cache time-to-live in seconds
-
connection_string
¶ SQL Connection string. Should be compatible with psycopg2.
-
random_seed
¶ Integer for query sampling. Using the same seed will result in repeatable queries.
-
dora.customers module¶
-
class
dora.customers.
Customers
(sql_config)¶ Bases:
dora.datasources.SqlSource
-
clusterCustomers
(feature_set=None, n_clusters=8, algorithm='auto', init='k-means++', cluster_on=['numorders', 'gender', 'totalpop', 'totalspent', 'zipcode', 'medianage', 'totalmales', 'totalfemales'], scale=False)¶ Clusters the customers together based on cluster_on parameter.
Parameters: - feature_set (QueryResponse or dictionary) – optional. must have keys ‘results’ and ‘columns’.
- that will be clustered. (Data) –
- num_clusters (int) – optional. default=8 The number of clusters to form as well as the number of centroids to generate.
- algorithm (string) – optional. “auto”, “full” or “elkan”, default=”auto”. K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
- init (string) – optional. {‘k-means++’, ‘random’ or an ndarray}. Method for initialization, defaults to ‘k-means++’ ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. ‘random’: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
- cluster_on (list of str) – column names to use as cluster features
- scale (bool) – scale features
Returns: columns (
list
ofstr
): [‘numorders’, ‘gender’, ‘totalpop’, ‘totalspent’, ‘zipcode’, ‘customermatchedid’,’medianage’, ‘totalmales’, ‘totalfemales’, ‘householdid’, ‘firstname’, ‘numcustomerid’,’cluster’, ‘customerids’ ]results (
list
oftuple(int,str,int,int,float,int,int,float,int,str,int,int,list(int))
)numOrders is the number of times a customer has purchased a book. gender is the gender of the customer. zipcode identifiies the customers location. TotalPop is the total population for the zipcode. MedianAge is the median age of the population for the zipcode. customermatchedid is the number of customerids that matched with the customer. TotalMales is the total number of males of the population for the zipcode. TotalFemales is the totalnumber of females of the population for the zipcode. TotalSpent is the total amount the customer has spent on books. householdid is the customer’s hosuehold identification. firstname is the customer’s name.numCustomerid is the number of customerids per customer. cluster is the label of the cluster the customer belongs to. customerids is a list of all the customerids that correspond to that customer.
Return type:
-
idsForCustomer
(customermatchedids=[])¶ Find all customerid for customermatchedid
Parameters: customermatchedids (list) – optional. list of customermatchedids to filter on. Returns: columns ( list
ofstr
): [‘customermatchedid’, ‘customerid’]results (
list
oftuple(int,int)
)Return type: QueryResponse
-
membersOfHousehold
(householdID, sample_size=100)¶ For each household, find the customerid, firstname, and gender for each member.
Parameters: - householdID (int) – The householdID that the members will be found of.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘customerid’, ‘firstname’, ‘gender’]results (
list
oftuple(int,str,str)
)customerid is the unique customer id. firstname is the name of the customer. gender is the gender of the customer.
Return type:
-
productsByHousehold
(householdID, min_date='1900-1-1', max_date=None, sample_size=100)¶ For each household, all of the products that have been purchased.
Parameters: - householdID (int) – The householdID that the members will be found of.
- min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date. Limits the search result timeframe.
- sample_size – optional. Percentage of the data the query will run over.
-
statsByCustomer
(min_date='1900-1-1', max_date=None, householdid=[], sample_size=100)¶ - For each customer, find the number of books orders, gender, zipcode, household,
- first name, and total spend on books.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date. Limits the search result timeframe.
- householdid (tuple) – optional. householdids that will be excluded from the query results.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘numOrders’, ‘gender’, ‘zipcode’, ‘TotalPop’, ‘MedianAge’, ‘TotalMales’, ‘TotalFemales’,’TotalSpent’,’householdid’, ‘firstname’,’numCustomerid’ ]results (
list
oftuple(int,str,int,int,float,int,int,float,int,str,int)
)numOrders is the number of times a customer has purchased a book. gender is the gender of the customer. zipcode identifiies the customers location. TotalPop is the total population for the zipcode. MedianAge is the median age of the population for the zipcode. TotalMales is the total number of males of the population for the zipcode. TotalFemales is the total number of females of the population for the zipcode. TotalSpent is the total amount the customer has spent on books. householdid is the customer’s hosuehold identification. firstname is the customer’s name. numCustomerid is the number of customerids per customer.
Return type:
-
statsByHousehold
(min_date='1900-1-1', max_date=None, sample_size=100)¶ For each household find the total amount spent, the total number of order, the orderdate of thefirst and last order, the time spent as customer and the time since the last order.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date. Limits the search result timeframe.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘HouseholdID’, ‘TotalSpent’, ‘TotalOrders’, ‘first_order’, ‘last_order’, ‘time_as_customer’,’time_since_last_order’]results (
list
oftuple(int,float,int,date,date,interval,interval)
)householdid is the unique household id. TotalSpent is the amount of money spenton all order. TotalOrders is the number of orders that have been made by that household. first_order is the time when the first order was made. last_order is the time since the lastorder. time_as_customer is the time the members of the household has been customers.
Return type:
-
dora.datasources module¶
-
class
dora.datasources.
AsterixConnection
(server)¶ Bases:
object
-
query
(statement, pretty=False, client_context_id=None)¶
-
-
class
dora.datasources.
AsterixQueryResponse
(raw_response)¶ Bases:
object
-
class
dora.datasources.
AsterixSource
(asterix_config)¶ Bases:
dora.datasources.Cacheable
-
class
dora.datasources.
Cacheable
(ttl)¶ Bases:
object
-
class
dora.datasources.
QueryResponse
(columns, results)¶ Bases:
object
Main response object for API queries
-
to_csv
(path)¶ convert response results to a csv with header, saved to path
-
to_pandas
()¶ convert response results to pandas dataframe, all default pandas settings
-
vis
¶ QueryVis object for visualization convenience
-
-
class
dora.datasources.
SolrSource
(solr_config)¶ Bases:
dora.datasources.Cacheable
-
class
dora.datasources.
SqlSource
(sql_config)¶ Bases:
dora.datasources.Cacheable
dora.logger module¶
-
dora.logger.
log
(f)¶ Decorator function to log API function calls to local log file and to SqlSource.
dora.orders module¶
-
class
dora.orders.
Orders
(sql_config)¶ Bases:
dora.datasources.SqlSource
-
statsByProduct
(min_date='1900-1-1', max_date=None, sample_size=100)¶ Produces statistics for each product.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date.Limits the search result timeframe.
- sample_size (int) – optional. Percentage of the orders the query will run over.
Returns: - columns (
list
ofstr
): [‘productid’, ‘asin’, ‘num_orders’, ’first_order’, ‘last_order’, ‘days_on_sale’, ‘unitprice_min’,’unitprice_max’, ‘uniteprice_avg’, ‘numunits_min’, ‘numunits_max’, ‘numunits_avg’, ‘numunites_sum’, ‘totalprice_min’, ‘totalprice_max’, ‘totalprice_avg’, ‘totalprice_sum’]
results (
list
oftuple(str,int,int,float))
productid is the unique identifier for the product. asin is the asin for the product. first_order is the date of the first product being shipped. last_order is the last day an order was shipped. days_on_sale is the total number of days the product was on sale. unitprice_min is the minimum price the product. unitprice_max is the maximum price for the product. unitprice_avg is the average price for the product. numunits_min is the minimum number of times the book was purchased in one order. numunits_max is the largest number of times the book was purchased in one order. numunits_avg is the average number of times the book was purchased in the same order. numunits_sum is the number of times the book was purchased.
Return type:
-
statsByZipcode
(min_date='1900-1-1', max_date=None, sample_size=100)¶ For each zipcode, determine the number of orders and the total amount of money has been spent.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date. Limits the search result timeframe.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘countyname’, ‘countypop’, ‘NumofOrders’, ‘TotalSpending’]results (
list
oftuple(str,int,int,float))
countyname is the name of the county the zipcode corresonds to. countypop is the population of the county. NumofOrders is the number of orders that have been purchased by customers in the zipcode. TotalSpending is the amount of money customers in the zipcode have purchased.
Return type:
-
dora.products module¶
-
class
dora.products.
Products
(sql_config)¶ Bases:
dora.datasources.SqlSource
-
byCategory
(nodeid)¶ Retrieve products by category
Parameters: nodeid (int) – category nodeid to search for Returns: columns ( list
ofstr
): [‘productid’]results (
list
oftuple(int)
)Return type: QueryResponse
-
clusterProducts
(feature_set=None, n_clusters=8, algorithm='auto', cluster_on=['numorders', 'avgrating', 'category', 'days_on_sale', 'spring_sales', 'summer_sales', 'fall_sales', 'winter_sales'], random_state=None, asin=None, scale=False, PCA=False, n_components=8)¶ Clusters the books together using KMeans clustering utilizing the clusterQuery results as the features (num_orders, avgrating, category, and days_on_sale).
Parameters: - num_clusters (int) – optional. default=8 The number of clusters to form as well as the number of centroids to generate.
- algorithm (string) – optional. “auto”, “full” or “elkan”, default=”auto”. K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
- random_state (int) – optional. int used to genderate random number.
- asin (tuple(string)) – optional. asins will be the centers of the kmeans clustering.
Returns: columns (
list
ofstr
): [‘productid’, ‘asin’, ‘y_pred’]results (
list
oftuple(int, str, int)
)productid is the products unqiue identifier. asin is the identification of the book. y_pred is the label of the clsuter that the product belongs to.
Return type:
-
coPurchases
(asin, min_date='1900-1-1', max_date=None, sample_size=100)¶ For each given book, find all the books purchased in the same order as the given book and the number of times that book was purchased.
Parameters: - asin (list) – required. book asin ids. Determines the books that coPurchases will be searched for.
- min_date (string) – optional. date. Limits the search result timeframe.
- max_date (string) – optional. date. Limits the search result timeframe.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘asin’, ‘numPurch’]results (
list
oftuple(str,int)
)asin is the identification of the book that was purchased in the same order as one of the input bools. numPurch is the number of times the book the book was purchased.
Return type:
-
priceDistribution
(bins=5)¶ Produces statistics for each product.
Parameters: bins (int or list of tuple(int,int)) – If bins is an int, prices are bin’d with steps (max price-min price)/bin. A list of tuples can be used to create you own bin limits. Returns: columns ( list
ofstr
): [‘count_<bin min>_to_<bin max>’…]results (
list
oftuple(int, int...))
)Return type: QueryResponse
-
ratingsDistribution
(min_date='1900-1-1', max_date=None, asin=[], sample_size=100)¶ For each product, determine the how many 1, 2, 3, 4, and 5 star reviews the product received.
Parameters: - min_date (string) – optional. date. inclusive bottom limit of reviewTime
- max_date (string) – optional. date. inclusive upper limit of reviewTime
- asin (list) – optional. The asins of the products the rating distrubtion will be produced for. Defaults to returning distributions for all asins
- sample_size (int) – optional. Percentage of the reviews the query will run over.
Returns: columns (
list
ofstr
): [asin, productid, ‘one_star_votes’, ‘two_star_votes’, ‘three_star_votes’,’four_star_votes’, ‘five_star_votes’]results (
list
oftuple(str,int,int,int,int,int,int)
)asin is the label for the book. productid is the unique identifier for the product. one_star_votes is the number of one star reveiws the book received. two_star_votes is the number of two star reveiws the book received. three_star_votes is the number of three star reveiws the book received. four_star_votes is the number of four star reveiws the book received. five_star_votes is the number of five star reveiws the book received.
Return type:
-
seasonalOrderDistribution
(asin=[])¶ Produces statistics for each product.
Parameters: asin (list of str) – asin product filter Returns: columns ( list
ofstr
): [‘asin’,’productid’,’spring_sales’,’summer_sales’,’fall_sales’,’winter_sales’]results (
list
oftuple(str, int, int, int, int, int))
)Return type: QueryResponse
-
statsByProduct
(asin=[], min_date='1900-1-1', max_date=None, sample_size=100)¶ For each book the product id, asin, the number of times it was purchased, the average star rating, the product category, and days the product has been on sale is returned.
Parameters: - min_date (string) – optional. date. Limits the search result timeframe .
- max_date (string) – optional. date. Limits the search result timeframe.
- sample_size (int) – optional. Percentage of the data the query will run over.
Returns: columns (
list
ofstr
): [‘productid’, ‘asin’, ‘num_orders’, ‘avgrating’, ‘category’, ‘days_on_sale’]results (
list
oftuple(int,str,int,float,int,int)
)productid is the products unqiue identifier. asin is the identification of the book. num_orders counts the number of times the book has been purchased. avgrating is the average star rating of the book based on the user reviews. category is the product category that the book belongs to. days_on_sale is the number of days the book has been on sale.
Return type:
-
dora.recommendations module¶
-
class
dora.recommendations.
Recommendations
(sql_config)¶ Bases:
dora.datasources.SqlSource
-
insert
(customerid, productid, rank, timestamp)¶
-
statsByProduct
(productids=[], min_date='1900-1-1', max_date=None, sample_size=100)¶
-
dora.reviews module¶
-
class
dora.reviews.
Reviews
(sql_config, solr_config)¶ Bases:
dora.datasources.SqlSource
,dora.datasources.SolrSource
-
asinByTerms
(terms)¶ Given a list of words, find ASIN with the highest Solr score from its reviews.
Solr score is configurable by facet.method parameter. Default ‘fc’ option used. https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Thefacet.methodParameter
Parameters: terms (list) – list of strings Returns: asin is a product identifier. score is the metric Solr is configured for (see above). Return type: tuple
-
termsByAsin
(asin)¶ Given an ASIN, find the terms with the highest Solr score from its reviews.
Solr score is configurable by facet.method parameter. Default ‘fc’ option used. https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Thefacet.methodParameter
Parameters: asin (string) – required. product identifier. Analyze terms for this product’s reviews Returns: term is a token as defined by Solr. score is the metric Solr is configured for (see above). Return type: tuple(term, score)
-