What I learned in Data Mining.

Information Extraction

Information Extraction(IE) and Information Retrieval(IR)

IE pulls facts and structured information from the content of large text collections (usually corpora).
IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries.
With traditional query engines, getting the facts can be hard and slow.

IE would return information in a structured way
IR would return documents containing the relevant information somewhere (if you were lucky)
IE returns knowledge at a much deeper level than IR.
When would you use IE?
For access to news
For access to scientific reports
Named Entity Recognition
Identification of proper names in texts, and their classification into a set of predefined categories of interest

MUSE – MUlti-Source Entity Recognition

An IE system developed within GATE

Language and Computers

Document classification = sort documents into user-defined classes

classification tasks, such as sentiment analysis
One simple technique for identifying languages is to use n-grams(stretch of n tokens(i.e., letters or words)).
Document classification is an example of a computer science activity called machine learning, which is itself part of the subfield of artificial intelligence

Supervised learning: training set and test set have been labeled with desired “correct answers”.
Unsupervised learning: assume there are no pre-specified categories.

First step in classifying or clustering documents:
identify properties most relevant to the decision we want to make, i.e., features
To make a useful system we need to tell the computer two things:

Exactly which features are used and exactly how to detect them

feature engineering
Hard to automate this step

How to weight the evidence provided by the features

Often works well to use machine learning for this
feature engineering:
Kitchen sink strategy
Hand-crafted strategy
The best features may be hard to collect reliably
bag of words assumption
Imagine that we cut up a document and put the words in a bag
Record the odds ratio for ham to spam as 12.5/0.8.
Idea of Naive Bayes: count things that occur in the test set

authorship attribution

Stylometry
Lexical style markers

plagiarism

Data Warehousing

Introduction

We have lots of data, the point of data mining is to extract the information, knowledge and particularly get the right knowledge at the right time. The idea of data Warehousing is to make data avaliable to do data mining. you need to have a static snap-shot of data to do the evaluation.

Definition

a standardized data store
a process for bringing together disparate data from throughout an organization for decision-support purposes.

use different algorithms to compare results to know which algo is better to use the same data.
store in the database rather in one format.

subject-oriented: discriminating similar languages, has a particular topic
integrated: harvested from different sources into a single standard format(database, social media)
time-variant: twitter, news(data change every time, not predicting in the future)
non-volatile: static snap-shot, use the same dataset(pick the data at a particular time)

copy of transaction data, query and analysis

Data Mart

smaller, more focused data warehouse - a mini-warehouse
reflects the business rules of a specific business unit within an enterprise.

Generic Architecture of Data

transaction data: base level of data- raw material for understanding customer behavior
operational sumamary data: summaries are for a specific time period and utilize the transaction data for that time period.
decision support summary data: used to help make decisions about the business.
database schema: defines the structure of data, not the values of the data.
metadata: data
business rules: highest level of abstraction from operational data.
OLAP(ONline Analytical Processing): Data representation for ease of visualization.

is any data-set a data warehouse?
sis: no
library catalogue: no
vle: no
text in a textbook: yes

DEC-10 Lab

Data mining notes & weka notes

Information Extraction