Εξόρυξη δεδομένων και διαχείριση δεδομένων μεγάλης κλίμακας Ενότητα 7: Θέματα εξόρυξης δεδομένων και επιχειρηματικής ευφυΐας Case Study Χ. Σκουρλάς Τμήμα.

Εξόρυξη δεδομένων και διαχείριση δεδομένων μεγάλης κλίμακας Ενότητα 7: Θέματα εξόρυξης δεδομένων και επιχειρηματικής ευφυΐας Case Study Χ. Σκουρλάς Τμήμα Μηχανικών Πληροφορικής Τ.Ε. Ανοικτά Ακαδημαϊκά Μαθήματα στο ΤΕΙ Αθήνας Το περιεχόμενο του μαθήματος διατίθεται με άδεια Creative Commons εκτός και αν αναφέρεται διαφορετικά Το έργο υλοποιείται στο πλαίσιο του Επιχειρησιακού Προγράμματος «Εκπαίδευση και Δια Βίου Μάθηση» και συγχρηματοδοτείται από την Ευρωπαϊκή Ένωση (Ευρωπαϊκό Κοινωνικό Ταμείο) και από εθνικούς πόρους.

Γίνεται αναφορά σε σημαντικές έννοιες εξόρυξης δεδομένων (data mining) που αφορούν την επιχειρηματική ευφυία. Η διεκπεραίωση του Case Study γίνεται με χρήση Rapid Miner. Περιγραφή

CRISP-DM, the CRoss-Industry Standard Process for Data Mining. 2

CRISP-DM Step 1: Business (Organizational) Understanding Πώς μπορούμε να αυξήσουμε το περιθώριο κέρδους ανά μονάδα προϊόντος; Πώς μπορούμε να προβλέψουμε και να διορθώσουμε ατέλειες κατασκευής έτσι ώστε να αποφύγουμε την αποστολή ενός ελαττωματικού προϊόντος; Από εκεί, μπορείτε να αρχίσετε και να αναπτύξετε πιο συγκεκριμένες ερωτήσεις που θέλετε να απαντήσετε, και αυτό θα σας δώσει τη δυνατότητα να προχωρήσετε σε... 3

CRISP-DM Step 2: Data Understanding Από πού προέρχονται τα δεδομένα; Από ποιόν συλλέγονται; Χρησιμοποιήθηκε μια τυποποιημένη μέθοδος συλλογής (a standard method of collection); Τι σημαίνουν οι διάφορες στήλες και οι γραμμές των δεδομένων; Υπάρχουν ακρωνύμια ή συντομογραφίες που είναι άγνωστα ή ασαφή; 4

CRISP-DM Step 3: Data Preparation (Data Mining for the Masses) Η Προετοιμασία των δεδομένων (Data Preparation) περιλαμβάνει μια σειρά από δραστηριότητες. Μπορεί να ενώνει δύο ή περισσότερα σύνολα δεδομένων, να περιορίζει σύνολα δεδομένων μόνον σε εκείνες τις μεταβλητές που έχουν ενδιαφέρον σε μια συγκεκριμένη περίπτωση εξόρυξης δεδομένων, να καθαρίζει δεδομένα από «ακραίες» παρατηρήσεις, να συμπληρώνει – διαχειρίζεται ελλείποντα δεδομένα, να μορφοποιεί εκ νέου δεδομένα για λόγους συνέπειας κ.λπ. 5

CRISP-DM Step 4: Modeling (Data Mining for the Masses) Απλουστεύοντας, ένα μοντέλο, στην εξόρυξη δεδομένων, είναι μια ηλεκτρονική αναπαράσταση παρατηρήσεων – μετρήσεων (observations) του πραγματικού κόσμου. Τα μοντέλα προκύπτουν από την εφαρμογή αλγορίθμων που «αναλαμβάνουν» την αναζήτηση, τον εντοπισμό, και την εμφάνιση προτύπων ή μηνυμάτων στα δεδομένα. Υπάρχουν δύο βασικά είδη μοντέλων εξόρυξης: εκείνα που ταξινομούν (classify) και εκείνα που προβλέπουν (predict). 6

7 Data Mining for the Masses CRISP-DM Conceptual Model

Ορολογία εργαλείου Rapid Miner 8

A new data mining project in RapidMiner 9 The RapidMiner start screen

Import Data Set 10

Import Data Set – Steps 5 11

Names of the attributes 13

Data types, role 14

Where to store 15

Data View 16

Meta Data View 17

18 For numeric data types, RapidMiner has given us the average (avg), or mean, for each attribute, as well the standard deviation for each attribute. Standard deviations are measurements of how dispersed or varied the values in an attribute are, and so can be used to watch for inconsistent data. A good rule of thumb is that any value that is smaller than two standard deviations below the mean (or arithmetic average), or two standard deviations above the mean, is a statistical outlier... It’s important to realize also that while two standard deviations is a guideline, it’s not a hard-and-fast rule… One other step is needed in our data preparation. This is to change the data types of our selected attributes from integer to binominal. The association rules operators need this data type in order to function properly. M. North, Data Mining for the Masses, 2012

Toggle between Design Perspective and Results Perspective 19 Design Perspective

Data preparation: Numerical to Binominal 21

Drag the Numerical to Binominal operator into your stream. 22

Play 23

Meta Data View: data type transformation 24

Confidence 25 If we examined ten shopping baskets and found that cookies were purchased in four of them, and milk was purchased in seven, and that further, in three of the four instances where cookies were purchased, milk was also in those baskets, we would have a 75% confidence in the association rule: cookies → milk. This is calculated by dividing the three instances where cookies and milk coincided by the four instances where they could have coincided (3/4 =.75, or 75%). The rule cookies → milk had a chance to occur four times, but it only occurred three, so our confidence in this rule is not absolute. Now consider the reciprocal of the rule: milk → cookies. Milk was found in seven of our ten hypothetical baskets, while cookies were found in four. We know that the coincidence, or frequency of connection between these two products is three. So our confidence in milk → cookies falls to only 43% (3/7 =.429, or 43%). Milk had a chance to be found with cookies seven times, but it was only found with them three times, so our confidence in milk → cookies is a good bit lower than our confidence in cookies → milk. If a person comes to the store with the intention of buying cookies, we are more confident that they will also buy milk than if their intentions were reversed. This concept is referred to in association rule mining as Premise → Conclusion. Premises are sometimes also referred to as antecedents, while conclusions are sometimes referred to as consequents. M. North, Data Mining for the Masses, 2012

Support 26 We know that in our hypothetical example, cookies and milk were found together in three out of ten shopping baskets, so our support percentage for this association is 30% (3/10 =.3, or 30%). M. North, Data Mining for the Masses, 2012

(Design Perspective) Frequency Pattern Analysis: FP-Growth 27

FP-Growth 28

FP-Growth - your DM stream Both your exa port and your fre port are connected to res ports 29

Parameters pane 30

Play 31

Further investigation

Create Association Rules operator 33

Play 36

rules found with the 60% confidence threshold 37

confidence 42

Do linkages between attributes exist? Yes, they do.

Μελέτη Περίπτωσης (by Myrto Pirli) 46

Hostal Fernando, Barcelona 47

Plaza de San Jaime http://es.wikipedia.org/wiki/Plaza_de_San_Jaime 50

La Rambla, Barcelona http://en.wikipedia.org/wiki/La_Rambla,_Barcelona 51

Text Mining Procedure 52

Step 1: Establishing the corpus (2014) 53 - booking.com (949 total reviews) - TripAdvisor (333 total reviews) - Twitter (1 review). The booking.com list was pruned to contain only reviews that contained text (i.e. reviews that just had a numeric grade and a headline were excluded). We also decided to use reviews from the past four years (2010-2014), as older reviews may not be relevant to the state of the hotel in the present day. This left 90 reviews from booking.com, 89 from TripAdvisor and 1 from Twitter. From these a sample of 45 reviews was obtained as follows: 1) 22 reviews from booking.com, by starting from the first review and selecting every fifth review 2) 22 reviews from TripAdvisor. TripAdvisor divides reviews into the categories Excellent, Very good, Average, Poor, and Terrible, so a stratified sample was obtained containing 5 Excellent reviews (out of 21), 13 Very good (out of 51), 3 Average (out of 11), 1 Poor (out of 4), and 0 Terrible (out of 2). Within each stratum the reviews were selected as in the booking.com sample (every n-th review) 3) 1 review from Twitter

Step 2: Creating the Term-Document Matrix 54 An Excel spreadsheet was created, with columns representing terms and rows representing each document (review). In places where a phrase was more important than a plain word, the phrase was used as a term (for example "good value for money" instead of "good", "value", and "money"). Synonymous phrases were stored as one term (for example "good location", "great location" and "liked location" became "liked location, good location"). 91 terms-phrases were recorded. Afterwards we did a further reduction of the data, by further replacing synonyms

Replacement of synonymous terms 55 Initial termsFinal (replacement) term "excellent Wi-fi", "free Wi-fi", "Wi-fi""Wi-fi" "no Wi-fi", "disconnecting Wi-fi""bad Wi-fi" "shower pressure low", "shower needs replacing""shower problems" "substandard breakfast", "bad breakfast""bad breakfast" "church bells", "noisy street", "noisy common area", "water pipe noise" "noise"

Replacement of synonymous terms 56 This was done in order to not only reduce the number of terms, but also to increase the frequency of some rare terms in the total amount of documents so that they would be more meaningful (for example "shower pressure low" and "shower needs replacing" appeared in only 1 document each, but when replaced with the same synonym it appears in 2 documents). This left a final 83 terms.

Step 3: Extracting the knowledge 57 -The terms were sorted by frequency, i.e. how many documents they appeared in. - A histogram was created with the 35 most common terms. -Afterwards, the terms were divided into positive and negative terms and again sorted by frequency. - Histograms were created for the top 30 positive and all (21) negative terms. - RapidMiner can be used to find potential associations among the data. - Various algorithms could be used for this. - Examples of data sets for experimentation: 1)one containing all the data (documents and terms) 2)one containing all the documents but only the negative terms 3)one containing all the terms but only the documents that contained negative terms.

Sample - Example 58

How to use the tool RapidMiner to find potential associations among the data 62 The algorithm FP-Growth was used. Three data sets were used: 1)one containing all the data (documents and terms), 2)one containing all the documents but only the negative terms 3)one containing all the terms but only the documents that contained negative terms. The support threshold was 25% and the confidence threshold 60%. For each association the relationship proposed was evaluated with both the lift and the Kulczynski measure The Kulczynski measure is used to counter the effect of many null- transactions (Han, Camber & Pei, 2011, p. 270).

Method 63 In general, the associations found were not very strong. The results for the first set show a small positive association between "clean rooms" and "like location". For the second set no associations were found. For the third set we see again the small positive association between "clean rooms" and "like location". There also seems to be a slight positive association between "liked breakfast" and "liked location", and between "kind staff" and "clean rooms" and "liked location".

clustering 64 The Hopkins statistic was used to assess the clustering tendency of the data (Han, Camber & Pei, 2011, pp.484-486). The distance between the documents was calculated with RapidMiner's Data to Data Similarity Model Excel used to find the minimum distance. The value found for the Hopkins statistic was H = 70/(68+70) ≈ 0.5073, It means that the data set is unlikely to have statistically significant clusters. Therefore we did not proceed further with clustering.

Data to Data Similarity 65

Data to Data Similarity 66

Write Excel operator 67

Associations for complete data set 69

Associations for data set containing only negative reviews 70

Analysis 71 The most popular website for leaving reviews was booking.com. This is not unexpected, as many hotel customers will have booked through this website and naturally turn to it to leave feedback (feedback leaving is also encouraged via email from booking.com to customers). TripAdvisor was also very popular. Twitter, on the contrary, seemed to not be popular at all; most of the mentions of the hotel were simply someone stating that they were staying there, and only one was a review

72 Most of the reviews were generally positive, something that was reflected in the term frequency. The hotel location was cited as the most positive thing, and this probably accounts for its presence in all the association patterns. The most common problems cited were the size and state of the rooms (badly lit, shabby decor), noise, quality of the breakfast, rude or grumpy staff, and bad Wi-fi. Regarding the last three, however, more people had the opposite experience (good breakfast, kind, helpful staff, and good Wi-fi). A possible explanation is that most members of the staff are helpful, but there are some who are not and the people claiming rude or grumpy staff had experiences with the latter.

73 We would recommend that the hotel offer the staff some training, to ensure that they all realise the importance of being friendly and helpful. This should also be offered to every newly hired staff member and anyone hired temporarily (e.g. only for the high season), and would minimize the chance of customers having a negative experience with a staff member.

74 Regarding the rooms, redecorating should be done to those rooms that have not yet been renovated to fix the shabby decor. Windows should be double-glazed and the doors should be soundproofed, to reduce the noise in the rooms. About the lighting and the size there is probably not much that can be done, as it depends on the hotel's location, however the owners might want to look into knocking down some walls between some of the smaller rooms to create one bigger room (for example two small twin rooms could be merged to create one triple room).

75 As far as breakfast is concerned, we recommend that the management ask for customer feedback and recommendations to pinpoint more precisely what some people think is wrong with the breakfast. Is it not varied enough? Is the quality of the food offered bad? They should leave feedback forms in the breakfast area and/or the rooms asking customers to rate different aspects of the breakfast and offer suggestions for its improvement. Then they must act on the feedback they get. The Wi-fi coverage should be checked to ensure all rooms have a good quality signal. If, as seems likely, some rooms are found to have poor signal, steps should be taken to correct this.

Business model 76

Conclusions 77

Text mining has offered an interesting insight into customers’ opinions about Hostal Fernando. Although for the most part the feedback has been positive, there have been some rather common complaints, the most popular of which were the size of the rooms and noise, which show there is room for improvement. The recommendations are that the hotel owners redecorate, refurbish, offer training to their staff, ensure Wi-fi signal quality in all rooms and solicit customer feedback in order to gain a better understanding of certain complaints. In this way customer satisfaction will improve and the hotel will remain antagonistic as a business in the future. 78

References 79 1.Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques. 3rd ed. Saint Lous: Morgan Kaufmann. 2.Turban, E., Sharda, R., Delen, D., King, D. and Aronson, J. E., 2011. Business Intelligence: a managerial approach. 2nd ed. Upper Saddle River: Pearson Prentice Hall. 3.M. North, Data Mining for the Masses, 2012, ISBN: 978-0615684376

Σημείωμα Αναφοράς Copyright Τεχνολογικό Εκπαιδευτικό Ίδρυμα Αθήνας, Χ. Σκουρλάς 2014. Χ. Σκουρλάς. «Εξόρυξη δεδομένων και διαχείριση δεδομένων μεγάλης κλίμακας. Ενότητα 7: «Θέματα εξόρυξης δεδομένων και επιχειρηματικής ευφυίας». Έκδοση: 1.0. Αθήνα 2014. Διαθέσιμο από τη δικτυακή διεύθυνση: ocp.teiath.gr. ocp.teiath.gr

Σημείωμα Αδειοδότησης Το παρόν υλικό διατίθεται με τους όρους της άδειας χρήσης Creative Commons Αναφορά, Μη Εμπορική Χρήση Παρόμοια Διανομή 4.0 [1] ή μεταγενέστερη, Διεθνής Έκδοση. Εξαιρούνται τα αυτοτελή έργα τρίτων π.χ. φωτογραφίες, διαγράμματα κ.λ.π., τα οποία εμπεριέχονται σε αυτό και τα οποία αναφέρονται μαζί με τους όρους χρήσης τους στο «Σημείωμα Χρήσης Έργων Τρίτων». [1] http://creativecommons.org/licenses/by-nc-sa/4.0/ Ως Μη Εμπορική ορίζεται η χρήση: που δεν περιλαμβάνει άμεσο ή έμμεσο οικονομικό όφελος από την χρήση του έργου, για το διανομέα του έργου και αδειοδόχο που δεν περιλαμβάνει οικονομική συναλλαγή ως προϋπόθεση για τη χρήση ή πρόσβαση στο έργο που δεν προσπορίζει στο διανομέα του έργου και αδειοδόχο έμμεσο οικονομικό όφελος (π.χ. διαφημίσεις) από την προβολή του έργου σε διαδικτυακό τόπο Ο δικαιούχος μπορεί να παρέχει στον αδειοδόχο ξεχωριστή άδεια να χρησιμοποιεί το έργο για εμπορική χρήση, εφόσον αυτό του ζητηθεί.

Διατήρηση Σημειωμάτων Οποιαδήποτε αναπαραγωγή ή διασκευή του υλικού θα πρέπει να συμπεριλαμβάνει:  το Σημείωμα Αναφοράς  το Σημείωμα Αδειοδότησης  τη δήλωση Διατήρησης Σημειωμάτων  το Σημείωμα Χρήσης Έργων Τρίτων (εφόσον υπάρχει) μαζί με τους συνοδευόμενους υπερσυνδέσμους.

Χρηματοδότηση Το παρόν εκπαιδευτικό υλικό έχει αναπτυχθεί στo πλαίσιo του εκπαιδευτικού έργου του διδάσκοντα. Το έργο «Ανοικτά Ακαδημαϊκά Μαθήματα στο Πανεπιστήμιο Αθηνών» έχει χρηματοδοτήσει μόνο την αναδιαμόρφωση του εκπαιδευτικού υλικού. Το έργο υλοποιείται στο πλαίσιο του Επιχειρησιακού Προγράμματος «Εκπαίδευση και Δια Βίου Μάθηση» και συγχρηματοδοτείται από την Ευρωπαϊκή Ένωση (Ευρωπαϊκό Κοινωνικό Ταμείο) και από εθνικούς πόρους.

Εξόρυξη δεδομένων και διαχείριση δεδομένων μεγάλης κλίμακας Ενότητα 7: Θέματα εξόρυξης δεδομένων και επιχειρηματικής ευφυΐας Case Study Χ. Σκουρλάς Τμήμα.

Παρόμοιες παρουσιάσεις

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια

Είσοδος

Σύνδεση μέσω των κοινωνικών δικτύων:

Εξόρυξη δεδομένων και διαχείριση δεδομένων μεγάλης κλίμακας Ενότητα 7: Θέματα εξόρυξης δεδομένων και επιχειρηματικής ευφυΐας Case Study Χ. Σκουρλάς Τμήμα.

Παρόμοιες παρουσιάσεις

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια