Βάσεις Δεδομένων 2001-2002 Ευαγγελία Πιτουρά 1 Data Mining.

Slides:



Advertisements
Παρόμοιες παρουσιάσεις
GB ( ) 5 1 ( ) ( ) ( /cm 2 ) 0.2 /30min·φ90 (5 /m 3 ) 0.4 /30min·φ90 (10 /m 3 ) /30min·φ90 (25 /m 3 )
Advertisements

Ancient Greek for Everyone: A New Digital Resource for Beginning Greek Unit 4: Conjunctions 2013 edition Wilfred E. Major
Προβολή SPmC TURBOHALER ΑΣθΜΑ ΧΑΠ Subordinated pages Animation step Structure of the pages is clear No animation Simple animation.
Εξόρυξη Δεδομένων.
ΕΞΟΡΥΞΗ ΔΕΔΟΜΕΝΩΝ ΔΙΑΛΕΞΗ 1
(Business Process & Workflow Management Systems)
SCHOOL YEAR Ms Kefallinou. Language A: Language and Literature is directed towards developing and understanding the constructed nature of meanings.
Πρωτόκολλα για Ασύρματα Δίκτυα και ΑΤΜ. Σιγανός Γεώργιος Multiplexing voice and video packet traffic Εργαστήριο Τηλεπικοινωνιών Πολυτεχνείο Κρήτης “Traffic.
Σεμινάριο Ανάπτυξης Ανθρώπινου Δυναμικού
Ancient Greek for Everyone: Unit 3: Greek Nouns supplement: Tips on Translating Greek into English GREK 1001 Fall 2013 M-Th 9:30-10:20 Coates 236 Wilfred.
Business Process Management and Knowledge Toolkit
Βάσεις Δεδομένων Ευαγγελία Πιτουρά 1 Distributed Database Systems.
Θεωρία Γραφημάτων Θεμελιώσεις-Αλγόριθμοι-Εφαρμογές
Τεχνολογία ΛογισμικούSlide 1 Έλεγχος Καταψύκτη (Ada) Τεχνολογία ΛογισμικούSlide 39 with Pump, Temperature_dial, Sensor, Globals, Alarm; use Globals ; procedure.
Προγραμματισμός PASCAL Πληροφορική Γ' Λυκείου μέρος γ
Ανάκτηση Πληροφορίας Αποτίμηση Αποτελεσματικότητας.
Τι θα φέρει το Σύννεφο στη Διαχείριση Δεδομένων: Προκλήσεις και Ευκαιρίες Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων 2010 Ευαγγελία Πιτουρά Τμήμα Πληροφορικής,
Page  1 Ο.Παλιάτσου Γαλλική Επανάσταση 1 ο Γυμνάσιο Φιλιππιάδας.
© GfK 2012 | Title of presentation | DD. Month
Βάσεις Δεδομένων Ευαγγελία Πιτουρά 1 Query Optimization.
Προγραμματισμός ΙΙ Διάλεξη #7: Περισσότερες Δομές Ελέγχου Δρ. Νικ. Λιόλιος.
Προγραμματισμός ΙΙ Διάλεξη #6: Απλές Δομές Ελέγχου Δρ. Νικ. Λιόλιος.
1 Please include the following information on this slide: Παρακαλώ, συμπεριλάβετε τις παρακάτω πληροφoρίες στη διαφάνεια: Name Giannakodimou Aliki Kourkouta.
Developing Human Values Through the Cross-curricular Approach.
Δομές Δεδομένων 1 Στοίβα. Δομές Δεδομένων 2 Στοίβα (stack)  Δομή τύπου LIFO: Last In - First Out (τελευταία εισαγωγή – πρώτη εξαγωγή)  Περιορισμένος.
6 Η ΠΑΡΟΥΣΙΑΣΗ: ΠΑΝΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΟΙΝΩΝΙΚΩΝ ΚΑΙ ΠΟΛΙΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ: ΕΠΙΚΟΙΝΩΝΙΑΣ, ΜΕΣΩΝ ΚΑΙ ΠΟΛΙΤΙΣΜΟΥ ΜΑΘΗΜΑ: ΕΙΣΑΓΩΓΗ ΣΤΗ ΔΙΑΦΗΜΙΣΗ.
Προγραμματισμός ΙΙ Διάλεξη #5: Εντολές Ανάθεσης Εντολές Συνθήκης Δρ. Νικ. Λιόλιος.
1 Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής Πανεπιστήμιο Πατρών ΟΝΤΟΚΕΝΤΡΙΚΟΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ ΙΙ (C++) Κληρονομικότητα.
“ Ἡ ἀ γάπη ἀ νυπόκριτος. ἀ ποστυγο ῦ ντες τ ὸ πονηρόν, κολλώμενοι τ ῷ ἀ γαθ ῷ, τ ῇ φιλαδελφί ᾳ ε ἰ ς ἀ λλήλους φιλόστοργοι, τ ῇ τιμ ῇ ἀ λλήλους προηγούμενοι.
Βάσεις Δεδομένων Εργαστήριο ΙΙ Τμήμα Πληροφορικής ΑΠΘ
+19 Δεκέμβριος 2014 Δείκτης > +20 Δείκτης 0 έως +20 Δείκτης 0 έως -20 Δείκτης < -20 Συνολικά της ΕΕ: +5 Δείκτης > +20 Δείκτης 0 έως +20 Δείκτης 0 έως -20.
Θεωρία Γραφημάτων Θεμελιώσεις-Αλγόριθμοι-Εφαρμογές Ενότητα 8 Τ ΕΛΕΙΑ Γ ΡΑΦΗΜΑΤΑ Σταύρος Δ. Νικολόπουλος 1.
Translation Tips LG New Testament Greek Fall 2012.
Βάσεις Δεδομένων ΙΙ Ευαγγελία Πιτουρά 1 Εξόρυξη Δεδομένων.
Week 11 Quiz Sentence #2. The sentence. λαλο ῦ μεν ε ἰ δότες ὅ τι ὁ ἐ γείρας τ ὸ ν κύριον Ἰ ησο ῦ ν κα ὶ ἡ μ ᾶ ς σ ὺ ν Ἰ ησο ῦ ἐ γερε ῖ κα ὶ παραστήσει.
Αριθμητική Επίλυση Διαφορικών Εξισώσεων 1. Συνήθης Δ.Ε. 1 ανεξάρτητη μεταβλητή x 1 εξαρτημένη μεταβλητή y Καθώς και παράγωγοι της y μέχρι n τάξης, στη.
Υγεία και Ευεξία Μαθητών/τριών
Μαθαίνω με “υπότιτλους”
Διασύνδεση LAN Γιατί όχι μόνο ένα μεγάλο LAN
Αντικειμενοστραφής Προγραμματισμός ΙΙ
JSIS E 111: Elementary Modern Greek
Αλγόριθμοι Ταξινόμησης – Μέρος 3
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ
(ALPHA BANK – EUROBANK – PIRAEUS BANK)
Οσμές στη Σχεδίαση του Λογισμικού
aka Mathematical Models and Applications
Coherence.
GLY 326 Structural Geology
ΕΝΣΤΑΣΕΙΣ ΠΟΙΟΣ? Όμως ναι.... Ένα σκάφος
Find: minimum B [ft] γcon=150 [lb/ft3] γT=120 [lb/ft3] Q φ=36˚
Βάλια Τόλιου, Registry Manager for Greece
ΜΕΤΑΦΡΑΣΗ ‘ABC of Selling’. ΤΟ ΑΛΦΑΒΗΤΑΡΙ ΤΩΝ ΠΩΛΗΣΕΩΝ
Find: ρc [in] from load (4 layers)
CPSC-608 Database Systems
Erasmus + An experience with and for refugees Fay Pliagou.
Database Programming Using Oracle 11g
Production of Supra-regular Spatial Sequences by Macaque Monkeys
Cipher Feedback Mode Network Security.
Applications/Requirements for Public-key
Baggy Bounds checking by Akritidis, Costa, Castro, and Hand
Ellen Geer C Garrision Thursday 2-3:30
To Teach Curricular Subjects
Simulated self-motion alters perceived time to collision
Chiltern Hills Academy
Kanaka Creek School Teams Session January 30, 2018
Unit 5: Working with Parents and Others in Early Years
Entry 27 – Starter Copy and simplify
Complements White Box Testing Finds a different class of errors
Inheritance and Polymorphism
Μεταγράφημα παρουσίασης:

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 1 Data Mining

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 2 Introduction Finding interesting trends or pattern in large datasets Statistics: exploratory data analysis Artificial intelligence: knowledge discovery and machine learning Scalability with respect to data size An algorithm is scalable if the running time grows (linearly) in proportion to the dataset size, given the available system resources

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 3 Introduction Finding interesting trends or pattern in large datasets SQL queries (based on the relational algebra) OLAP queries (higher-level query constructs – multidimensional data model) Data mining techniques

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 4 Knowledge Discovery (KDD) Data Selection Data Cleaning Data Mining Evaluation The Knowledge Discovery Process Identify the target dataset and relevant attributes Remove noise and outliers, transform field values to common units, generate new fields, bring the data into the relational schema Present the patterns in an understandable form to the end user (e.g., through visualization)

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 5 Overview Counting Co-occurrences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 6 Counting Co-Occurrences A market basket is a collection of items purchased by a customer in a single customer transaction Identify items that are purchased together

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 7 Example Transidcustiddateitemqty /1/99pen /1/99ink /1/99milk /1/99juice /3/99pen /3/99ink /3/99milk /10/99pen /10/99milk /1/99pen /1/99ink /1/99juice4 one transaction note that there is redundancy Observations of the form: in 75% of transactions both pen and ink are purchased together

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 8 Frequent Itemsets Itemset: a set of items Support of an itemset: the fraction of transactions in the database that contain all items in the itemset Example: Itemset {pen, ink} Support 75% Itemset {milk, juice} Support 25%

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 9 Frequent Itemsets Example: If minsup = 70%, Frequent Itemsets {pen}, {ink}, {milk}, {pen, ink}, {pen, milk} Frequent Itemsets: itemsets whose support is higher than a user specified minimum support called minsup

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 10 Frequent Itemsets An algorithm for identify (all) frequent itemsets? The a priory property: Every subset of a frequent itemset must also be a frequent itemset

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 11 Frequent Itemsets For each item, check if it is a frequent itemset k = 1 repeat for each new frequent itemset I k with k items Generate all itemsets I k+1 with k+1 items, I k  I k+1 Scan all transactions once and check if the generated k+1 itemsets are frequent k = k + 1 Until no new frequent itemsets are identified An algorithm for identifying frequent itemsets

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 12 Frequent Itemsets For each item, check if it is a frequent itemset k = 1 repeat for each new frequent itemset I k with k items Generate all itemsets I k+1 with k+1 items, I k  I k+1 whose subsets are ferquent itemsets Scan all transactions once and check if the generated k+1 itemsets are frequent k = k + 1 Until no new frequent itemsets are identified A refinement of the algorithm for identifying frequent itemsets

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 13 Iceberg Queries Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 Execution plan for the query? The number of groups is very large but the answer to the query (the tip of the iceberg) is usually very small

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 14 Iceberg Queries select R.A1, R.A2, …, R.Ak, agr(R.B) from Relation R group by R.A1, R.A2, …, R.Ak having agr(R.B) > = constant Iceberg query A priory property similar to the a priori property for the frequent itemsets?

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 15 Iceberg Queries select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 select P.custid from Purchases P group by P.custid having sum (P.qty) > 5 select P.item from Purchases P group by P.item having sum (P.qty) > 5 Generate (custid, item) pairs only for custid from Q1 and item from Q2 Q1 Q2

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 16 Overview Counting Co-occurences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 17 Association Rules Example {pen}  {ink} If a pen is purchased in a transaction, it is likely that ink will also be purchased in the transaction In general: LHS  RHS

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 18 Association Rules LHS  RHS Support: support(LHS  RHS) The percentage of transactions that contain all of these items Confidence: support(LHS  RHS) / support(LHS) Is an indication of the strength of the rule P(RHS | LHS) An algorithm for finding all rules with minsum and minconf?

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 19 Association Rules An algorithm for finding all rules with minsup and minconf Step 1: Find all frequent itemsets with minsup Step 2: Generate all rules from step 1 For each frequent itemset I with support support(I) Divide I into LHS I and RHS I confidence = support(I) / support(LHS I ) Step 2

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 20 Association Rules and ISA Hierarchies An ISA hierarchy or category hierarchy upon the set of items: a transaction implicitly contains for each of its items all of the item’s ancestors Beverage Milk Juice Stationery Pen Ink  Detect relationships between items at different levels of the hierarchy  In general, the support of an itemset can only increase if an item is replaced by its ancestor

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 21 Generalized Association Rules More general: not just customer transactions e.g., Group tuples by custid Rule {pen}  {milk}: if a pen is purchased by a customer, it likely that milk will also be purchased by the customer Transidcustiddateitemqty /1/99pen /1/99ink /1/99milk /1/99juice /3/99pen /3/99ink /3/99milk /10/99pen /10/99milk /1/99pen /1/99ink /1/99juice4

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 22 Generalized Association Rules Group tuples by date: Calendric market basket analysis A calendar is any group of dates; e.g., every first of the month Given a calendar, compute association rules over the set of tuples whose date field falls within the calendar Transidcustiddateitemqty /1/99pen /1/99ink /1/99milk /1/99juice /3/99pen /3/99ink /3/99milk /10/99pen /10/99milk /1/99pen /1/99ink /1/99juice4 Calendar: every first of the month Rule {pen}  {juice}: has support 100% Over the entire: 50% Rule {pen}  {milk}: has support 50% Over the entire: 75%

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 23 Sequential Patterns Sequence of Itemsets: The sequence of itemsets purchased by the customer: Example custid 201: {pen, ink, milk, juice}, {pen, ink, juice} (ordered by date) A subsequence of a sequence of itemsets is obtained by deleting one or more itemsets and is also a sequence of itemsets

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 24 Sequential Patterns A sequence {a 1, a 2,.., a n } is contained in sequence S if S has a subsequence {b q,.., b m } such that a i  b i for 1  i  m Example {pen}, {ink, milk}, {pen, juice} is contained in {pen, ink}, {shirt}, {juice, ink, milk}, {juice, pen, milk} The order of items within each itemset does not matter but the order of itemsets does matter {pen}, {ink, milk}, {pen, juice} is not conatined in {pen, ink}, {shirt}, {juice, pen, milk}, {juice, milk, ink}

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 25 Sequential Patterns The support for a sequence S of itemsets is the percentage of customer sequences of which S is a subsequence Identify all sequences that have a minimum support

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 26 Overview Counting Co-occurences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 27 Classification and Regression Rules InsuranceInfo(age: integer, cartype: string, highrisk: boolean) There is one attribute (highrisk) whose value we would like to predict: dependent attribute The other attributes are called the predictors General form of the types of rules we want to discover: P 1 (X 1 )  P 2 (X 2 )  …  P k (X k )  Y = c

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 28 Classification and Regression Rules P 1 (X 1 )  P 2 (X 2 )  …  P k (X k )  Y = c P i (X i ) are predicates Two types:  numerical P i (X i ) : l i  X i  h i  categoricalP i (X i ) : X i  {v 1, …, v j }  numerical dependent attributeregression rule  categorical dependent attributeclassification rule (16  age  25)  (cartype  {Sports, Truck})  highrisk = true

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 29 Classification and Regression Rules P 1 (X 1 )  P 2 (X 2 )  …  P k (X k )  Y = c Support: The support for a condition C is the percentage of tuples that satisfy C. The support for a rule C1  C2 is the support of the condition C1  C2 Confidence Consider the tuples that satisfy condition C1. The confidence for a rule C1  C2 is the percentage of such tuples that also satisfy condition C2

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 30 Classification and Regression Rules Differ from association rules by considering continuous and categorical attributes, rather than one field that is set-valued

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 31 Overview Counting Co-occurences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 32 Tree-Structured Rules Classification or decision trees Regression trees Typically the tree itself is the output of data mining Easy to understand Efficient algorithms to construct them

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 33 Decision Trees A graphical representation of a collection of classification rules. Given a data record, the tree directs the record from the root to a leaf. Internal nodes: labeled with a predictor attribute (called a splitting attribute) Outgoing edges: labeled with predicates that involve the splitting attribute of the node (splitting criterion) Leaf nodes: labeled with a value of a dependent attribute

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 34 Decision Trees Example Age Car Type > 25 No <= 25 Sports, TruckOther YesNo (16  age  25)  (cartype  {Sports, Truck})  highrisk = true Construct classification rules from the paths from the root to the leaf: LHS conjuction of predicates; RHS the value of the leaf

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 35 Decision Trees Constructed into two phases Phase 1: growth phase construct a vary large tree (e.g., leaf nodes for individual records in the database Phase 2: pruning phase Build the tree greedily top down: At the root node, examine the database and compute the locally best splitting criterion Partition the database into two parts Recurse on each child

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 36 Decision Trees Input: node n partition D split selection method S Output: decision tree for D rooted at node n Top down Decision Tree Induction Schema BuildTree(node n, partition D, method S) Apply S to D to find the splitting criterion If (a good splitting criterion is found) create two children nodes n1 and n2 of n partition D into D1 and D2 BuildTree(n1, D1, S) Build Tree(n2, D2, S)

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 37 Decision Trees Split selection method An algorithm that takes as input (part of) a relation and outputs the locally best spliting criterion Example: examine the attributes cartype and age, select one of them as a splitting attribute and then select the splitting predicates

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 38 Decision Trees How can we construct decision trees when the input database is larger than main memory? Provide the split selection method with aggregated information about the database instead of loading the complete database into main memory We need aggregated information for each predictor attribute AVC set of the predictor attribute X at node n is the projection of n’s database partition onto X and the dependent attribute where counts of the individual values in the domain of the dependent attribute are aggregated

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 39 Decision Trees agecartypehighrisk 23Sedanfalse 30Sportsfalse 36Sedanfalse 25Trucktrue 30Sedanfalse 23Trucktrue 30Truckfalse 25Sportstrue 18Sedanfalse AVC set of the predictor attribute age at the root node select R.age, R.highrisk, count(*) from InsuranceInfo R group by R.age, R.highrisk AVC set of the predictor attribute cartype at the left child of the root node select R.cartype, R.highrisk, count(*) from InsuranceInfo R where R.age <=25 group by R.age, R.highrisk

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 40 Decision Trees agecartypehighrisk 23Sedanfalse 30Sportsfalse 36Sedanfalse 25Trucktrue 30Sedanfalse 23Trucktrue 30Truckfalse 25Sportstrue 18Sedanfalse AVC set of the predictor attribute age at the root node select R.age, R.highrisk, count(*) from InsuranceInfo R group by R.age, R.highrisk TrueFalse Sedan 04 Sports 11 Truck21

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 41 Decision Trees Size of the AVC set? AVC group of a node n: the set of the AVC sets of all predictors attributes at node n

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 42 Decision Trees Input: node n partition D split selection method S Output: decision tree for D rooted at node n Top down Decision Tree Induction Schema BuildTree(node n, partition D, method S) Make a scan over D and construct the AVC group of node n in memory Apply S to AVC group to find the splitting criterion If (a good splitting criterion is found) create two children nodes n1 and n2 of n partition D into D1 and D2 BuildTree(n1, D1, S) Build Tree(n2, D2, S)

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 43 Overview Counting Co-occurences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 44 Clustering Partition a set of records into groups (clusters) such that all records within a group are similar to each other and records that belong to two different groups are disimilar. Similarity between records measured computationally by a distance function.

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 45 Clustering CustomerInfo(age: integer, salary:real) Age Salary Visually identify three clusters - shape of clusters: spherical spheres

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 46 Clustering The output of a clustering algorithm consists of a summarized representation of each cluster. Type of output depends on type and shape of clusters. For example if spherical clusters: center C (mean) and radius R: given a collection of records r 1, r 2,.., r n C =  r i R =  (r i - C) n

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 47 Clustering Two types of clustering algorithms: Partitional clustering: partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized Hierarchical clustering generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of a single record, merges two partitions in each step

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 48 Clustering Assumptions Large number of records, just one scan of them A limited amount of main memory The BIRCH algorithm: Two parameters k: main memory threshold: maximum number of clusters that can be maintained in memory e: initial threshold of the radius of each cluster. A cluster is compact if its radius is smaller than e. Always maintain in main memory k or fewer compact cluster summaries (Ci, Ri) (If this is no possible adjust e)

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 49 Clustering Read a record r from the database Compute the distance of r and each of the existing cluster centers Let i be the cluster (index) such that the distance between r and C i is the smallest Compute R’ i assuming r is inserted in the ith cluster If R’ i  e, insert r in the ith cluster recompute R i and C i else start a new cluster containing only r The BIRCH algorithm:

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 50 Overview Counting Co-occurences Frequent Itemsets Iceberg Queries Mining for Rules Association Rules Sequential Rules Classification and Regression Tree-Structures Rules Clustering Similarity Search over Sequences

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 51 Similarity Search over Sequences A user specifies a query sequence and wants to retrieve all data sequences that are similar to the query sequence Not exact matches A data sequence X is a sequence of numbers X = Also called a time series k length of the sequence A subsequence Z = is deleting from another sequence by deleting numbers from the front and back of the sequence

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 52 Similarity Search over Sequences Given two sequences X and Y we can define the distance of the two sequences as the Euclidean norm Given a user-specified query sequence and a threshold parameter e, retrieve all data sequences within e-distance to the query sequence  Complete sequence matching (the query sequence and the sequence in the database have the same length)  Subsequence matching (the query is shorter)

Βάσεις Δεδομένων Ευαγγελία Πιτουρά 53 Similarity Search over Sequences Given a user-specified query sequence and a threshold parameter e, retrieve all data sequences within e-distance to the query sequence Brute-force method? Each data sequence and the query sequence of length k may be represented as a point in a k-dimensional space Construct a multidimensional index Non-exact matches? Query the index with a hyper-rectangle with side length 2-e and the query sentence as the center