CONTENTS

2017 Winter

CS 540 Database Managerment

class note

02/04/2017 Tue

Data dependence Access path dependence How you can organize the data Applications would hard code access paths to data, so would rely on the continued existence of the used access paths.

Levels of abstraction in DBMS Physical implementation storage structurem indexing, access method

operation on relations: deriving relations Permutation: interchange the columns of an n-ary relation Projection: select columns and remove any duplication in the rows Join: selectively combining tuples in two relations, as a “class” of new relation Composition Restriction

what is missing in the set of operators?: order by, group by Is it minimal? How is it different from current algebra

redundancy: redundant if can be derived from others; foundation: what operations allowed in derivation consistency: data snapshot must satisfy some constraints

the advantages of relational model: simplicity, mathemathcal

does relational model provide data independence? ordering indep? index indep? acces path indep? index: some way to organize data, like binary tree. index does not influence the query

parrallel processing: relations in and out: pipeline:piping the output of one op into the next; partition: N op-clones, each processes 1/N input Graphical user interfaces: relations fits the spreadsheet(table) metaphor ->table is easier than graphy to partition unless it’s a complex table

1970: resistance even within IBM; too mathematical, no system First implementation, 1973

relational data is time consuming. network/hierarchical models makes a come back. a great deal of graph data sets: web is a huge network database

Back to subcontents CS 540

01/30/2017 Mon

B+树总结 通过以上介绍,大致将B树,B+树,B*树总结如下:

  • B树:有序数组+平衡多叉树;数据存在于非叶子节点上
  • B+树:有序数组链表+平衡多叉树;数据只存在于叶子上。
  • B*树:一棵丰满的B+树。

走进搜索引擎的作者梁斌老师针对B树、B+树给出了他的意见: “B+树还有一个最大的好处,方便扫库,B树必须用中序遍历的方法按序扫库,而B+树直接从叶子结点挨个扫一遍就完了,B+树支持range-query非常方便,而B树不支持。这是数据库选用B+树的最主要原因。 比如要查 5-10之间的,B+树一把到5这个标记,再一把到10,然后串起来就行了,B树就非常麻烦。B树的好处,就是成功查询特别有利,因为树的高度总体要比B+树矮。不成功的情况下,B树也比B+树稍稍占一点点便宜。 B树比如你的例子中查,17的话,一把就得到结果了。 有很多基于频率的搜索是选用B树,越频繁query的结点越往根上走,前提是需要对query做统计,而且要对key做一些变化。 另外B树也好B+树也好,根或者上面几层因为被反复query,所以这几块基本都在内存中,不会出现读磁盘IO,一般已启动的时候,就会主动换入内存。”

Back to subcontents CS 540

LRU替换

LRU是Least Recently Used 近期最少使用算法。 内存管理的一种页面置换算法,对于在内存中但又不用的数据块(内存块)叫做LRU,操作系统会根据哪些数据属于LRU而将其移出内存而腾出空间来加载另外的数据。 实现方法是个问题,链表或者 大多是估计法。

Back to subcontents CS 540
Back to CONTENTS


paper review

01/12

review_Kaibo Liu_01.12_A Relational Model of Data for Large Shared Data Banks

What is the problem discussed in the paper?
This paper discussed data independencies in ralational data model for shared access. There should be minimal or no influence to users at terminal end or in the applications accessing this data when there are internal or external changes of representation of data. And also the problems for data inconsistencies.

Why is it important?
From growth in data types and changes in data representation, independence and inconsistency will become troublesome even in nondeductive systems. The advantages of relational model are it deals with derivability, redundancy, and consistency of relations.

What are the main ideas of the proposed solution for the problem?
All usual set operations can be applicable on relations and result may not be a relation. Permutation, Projection, Join, Composition, and Restriction are some operations specific for relations. The relational model is well explained with its properties such as: Each row is distinct, representing n-tuple in an n-ary relation ‘R’ with no particular order and each column has distinct order and well are defined with a label. Ordering of columns is needed as the order determines the relation in some tables, if the domain names are identical and to deal with time-varying relations. But if the relation is of higher order it is better to have unique domain names and the relations as domain-unordered.

What are the final results and conclusion of the paper?
This paper defined operations on relations and 2 types of redundancy, and applied them to the problem of maintaining the data in a consistent state. Also many questions are raised and left unanswered, this paper had some impact for the time.

Question: With too much too mathematical explanation, how can this paper tell a more specific way in system?

Back to subcontents CS 540


01.12-2

review_Kaibo Liu_01.12_The Universal-Relation Data Model for Logical Inependence

What is the problem discussed in the paper? This paper discussed a problem of access path independency, which was not compeletely solved in the 1970 paper. A specific example to discribe this problem is, considering a database which has two relations ED(Employee, Department) and DM(Department, Manager), if we’re interested in the relationship between employees and managers through departments, we must specify the natural joint of ED and DM relations and project it onto EM. Although this problem may be overvomed by defining a view on EM, that approach ay lead to an unwanted proliferation of views.

Why is it important? A complete access-path independence frees users from concern with the logical structure of the database, and protects users from the errors that creep into queries when complicated access paths must be specified.

What are the main ideas of the proposed solution for the problem? This paper proposed a new model called Universal-Relation Data Model. TThe model is based on an assumption that there is a universal-relation scheme, a set of attributes about which queries may be posed. Further, attributes in this set are assumed to play only one role, and puns are not allowed. Thus, an attribure like name cannot stand for names of employees, customers, suppliers, and managers in the same universal-relation scheme. There is another basic assumprion that for all attribute sets X there’s a unique relationship on this set X that the user has in mind. This underlying assumprion is called relationship uniqueness. The connection and query processing consists two steps: binding and evaluation. The two phases are independent.

What are the final results and conclusion of the paper? The universal-relation model gives users a more succinct language for expressing queries, frees them from concern with the databases’s logical structure, and protects them from the errors that creep into queries when complicated access paths must be specified. The disadvantage of the universal-relation model is that certain logical naviaation be done automatically by the system may place some subtle constraints on the data structure and may make unusual access paths harder to specify.

Question: What’s the efficiency of the approach compared to previous methods?

Back to subcontents CS 540


01.19

review_Kaibo Liu_01.19_Operating System Support for Database Management

This paper goes over several functions that operating systems of that time already provided, and examines whether the provided services are appropriate for DBMS functions or not. The first of these is the buffer pool management, providing a main memory cache for the file system, which does not work well for databases because the OS does not know which blocks to prefetch into memory, while the DBMS does. Therefore, a DBMS buffer manager has to be created to run in user space, to circumvent the OS’s version. In the second part, file system is discussed. The UNIX file system that time takes the data as array. On the other hand database provides an abstraction where user’s keys map to records. Constructing database on top of OS filing system is not always efficient due to the following requirement. The file might scattered over the disk and lose the physical contiguity. Second problem is that there are three tree structures: file control block tree, directories tree, and DBMS such as INGRES adds another tree for keyed access via a multilevel directory structure. The authors note that the file hierarchy does nothing for a DBMS, and DB developers must create their own indexes over flat character arrays. The authors then move onto scheduling and process management, and he provides four ways to organize a multiuser database system. Because DBMS manage their own locks to maintain consistency, they must also handle their own scheduling to avoid deadlocks or any other issues. The performance and other cost of replicating this facilities leave quite a bit to be desired, but is the best current option. In this paper Stonebraker mainly discusses how can an operating system be more friendly to database application, and exclaim that the operating system design should be more sensitive to database needs. I think it’s an inevitable trade off between generality and specificity.

As the author mentioned at the end of this paper, there are so-called real-time OS which provides minimal facilities which is closed to what the author suggests. And the author hopes that future OS will provide both sets of services in one environment. This is a good idea, but I am a little bit curious whether we need to separate conventional OS with a small efficient OS that provides desired services to DBMS. It would be great if we can achieve both in one environment, however, if it is not possible, what is the disadvantage of developing two different OS separately?

Back to subcontents CS 540


01.26

review_Kaibo Liu_01.26_Query Evaluation Techniques for Large Databases_Sections 5 and 7

In Sections 5 and 7 of this paper, 5.BINARY MATCHING OPERATIONS 5.1. Nested-Loops Join Algorithms 5.2 Merge-Join Algorithms 5.3 Hash Join Algorithms 5.4 Pointer-Based Joins 5.5 A Rough Performance Comparison 7.DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS

In Section5, the authors go on to describe nested loops, removing duplicates as early as possible, and hashing. I thought the second was a nice improvement, and easy to implement via sorting or hashing. From the graphs, it looks like hybrid hashing is the best way to do aggregation, though of course it is highly dependent on how good the hash function is. Joins are discussed in the next section. There are many techniques, though the authors note that newer ones are sometimes unable to answer the question, “Does this new technique apply to joining three inputs without interrupting data flow between the join operations?” I would like to know more about this question; is it widely considered an important property of join techniques? What kinds of techniques fail this test that have been seriously examined by the database community? Nested loop join, merge-join, and hash-join algorithms are described; we talked about these in class. The heap-filter merge join sounds like a good way to save I/O operations over merge join and be more efficient that nested join. Again, it was emphasized that skew in the hash function messes things up for hash join.

The paper reviews different operational aspects of a DBMS’s query execution engine. The most prominent aspects of the sections we reviewed are the treatment of sorting and hashing algorithms. Sorting - the algorithm described for sorting is essentially quicksort of datasets that fit in memory and merge for larger datasets. Sorting-based aggregation outperforms nested-loops and aggregating as early as possible is useful. Merge-join, the sorting based join algorithm is widely used since it’s rough cost can be estimated and it performs well enough. Can also be used to implement division between tables. Hashing - hybrid hashing writes only necessary data to disk saving I/O cost. Aggregation based on hybrid hash performs better than sorting-based algorithm with early aggregation. Although there is an overhead in computing the hash value, the search space is made smaller and so it outperforms merge-join. Combined with bit map, the hash method is efficient for processing division operation. The paper also briefly covers Indexing. The paper compares various index structures according to their support for ordering and sorted scans and their support for point data and range data.

Back to subcontents CS 540


01.31

review_Kaibo Liu_01.31_Access Path Selection in a Relational Database Management System

https://blog.acolyer.org/2016/01/04/access-path-selection/ This paper introduces the idea of a query optimizer, built as part of System R, that plans the most efficient way to retrieve the data requested by a SQL query. As well as giving insight into how a query optimizer may be constructed, the paper also quietly introduced the important result that a declarative query language can be supported with no loss of performance compared to the more common procedural query language approaches of the day. Declarative query language also relieves programmer of the burden to choose an access plan. Difference between good and bad access plans can be several orders of magnitude. And there are three problems in choosing a good plan: what is the plan space? How to estimate cost? How to choose a plan? To find the optimal plan for join operations, a tree of possible solutions is constructed. There is a worked example in the paper of joining Employee, Job title, and Department tables that helps to make this process clearer. The cost of a join is computed based on the costs of scans on each of the relations (using the cost estimate formulas we saw earlier) and the cardinalities.

Question:

Back to subcontents CS 540


02.07

review_Kaibo Liu_02.07_Granularity of Locks and Degrees of Consistency in a Shared Data Base

This paper is divided in two sections: granularity of locks, and degrees of consistency. Each section answers questions on how lock choice in a database affects throughput and consistency.

In the granularity section, the choice of lockable units is discussed. A lockable unit represents a section of logical data that is atomically locked during a transaction. Locking smaller units such as individual records improves concurrency for “simple” transactions that access a small number of records. On the other hand, locking at a record level can reduce throughput for “complex” transactions that require access to many records — the overhead of acquiring and releasing locks overwhelms the computation. It follows that having different sizes of lockable units in the same system is required to handle multiple use cases. Two types of locks are presented: X or exclusive locks, and S or shared locks.

The second section defines four different modes of labelled degrees 0 through 3. · A degree 0 transaction is the least restrictive type. It promises to not over-write data from other transactions. It requires having any transaction take a lock on any data it writes.

  • Degree 1(Read Uncommitted) consistency keeps the promise of Degree 1 (not to overwrite data) and also agrees not to commit any writes until the end of the transaction. In this case, you may require a longer lock that is held until the end of the transaction.
  • Degree 2 (Read Committed) consistency further restricts a transaction to only read values that have been committed (to contrast, a degree 1 transaction may read dirty values). In addition to acquiring locks for all data written during the transaction, a degree 2 transaction acquires locks for all data read during the transaction.
  • Degree 3 (Repeatable Read) consistency completely isolates a transaction from each other. It acquires long-lived locks on both data read and data written. Degree 3 provides the highest level of isolation described in the paper.

Question:

Back to subcontents CS 540


02.09

review_Kaibo Liu_02.09ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging<=Section 6.3

New algorithms for database recovery and rollbacks are described. The paper assumes that the database uses write-ahead logging (WAL), but it describes in fine detail how the various activities during the update, rollback, and recovery phases are to act so as to maximize concurrency and minimize both overhead and time. In their introduction, the authors also provide an excellent description of the current state of the art of logging, failures, and recovery methods. The paper is broken into 12 sections and has an extensive bibliography (101 citations). The sections are an introduction, goals, an overview of ARIES, a description of the major data structures, a discussion of the actions that are part of normal processing (including transaction failure), a description of restart processing (after system failure), a description of the impact of checkpoints during restart, the methods necessary for media recovery, top actions (independent transactions kicked off by running transactions such as file extension), recovery paradigms (mostly problems caused by them), properties of other WAL-based methods (including references to several commercial implementations), and a summary of the attributes of ARIES. As the proposed solution, the fundamental idea of database recovery is to log logical changes to the database to durable storage before applying those changes to the actual database. If this protocol is followed, then any failures can be recovered by using the change log itself. This simple idea is used throughout the paper to illustrate the recovery algorithms for transaction failure, crashes, and storage media failure. As part of the ARIES protocol, each log is assigned a log sequence number (LSN) uniquely identifying the log record. Further, each page in the database records the LSN that modified the page. ARIES also tracks any in-flight transactions to be able to fully restore the database to the point-in-time of the crash before doing full recovery.
To perform recovery, ARIES uses the log and the transaction table during three passes: analysis, redo, and undo. During the analysis pass, the log is scanned to extract the dirty pages and in-flight transactions from the point of failure, determining the starting point where redo is required, and the in-flight transactions that must be rolled back. During the redo pass, the database is restored to the state it was at the time of failure. Finally, during undo pass, any in-flight transactions that failed have their changes rolled back. Combining these three passes restores the database to a consistent state.

Back to subcontents CS 540


02.28

review_Kaibo Liu_02.28_The Gamma Database Machine Project

  1. What is the problem discussed in the paper?

The motivation behind Gamma was to support horizontally scalable database using commodity parts.

  1. What are the main ideas of the proposed solution for the problem?

The Gamma project uses a shared-nothing architecture to build a horizontally scalable database for the same reasons we see NoSQL vendors tackling the same problem today. Communication between nodes in the system is handled exclusively by passing messages between one another over the network.

Gamma is divided into several processes, the Catalog Manager handles database schema and metadata information, the Query Manager is associated with each active Gamma user and is responsible for providing ad-hoc query support. Coordination of multi-node queries is handled through a Scheduler Process that handles the execution of individual Operator Processes on each node.

Gamma maintains a split table mapping tuples to the node that the tuple resides on. Queries are then shipped to the node where the tuple resides for processing.

Selection operators are parallelized over all nodes by initiating a selection on the set of nodes where the required tuples are held. If the selection query is based on the attribute responsible for building the split table, then a subset of nodes will be engaged to process the query. Otherwise, the selection operator is run on every node in the cluster.

Join operators are executed using the Hybrid hash-join algorithm, where relations are first partitioned into N buckets and the first bucket is used to build an in-memory hash table. The remaining buckets are stored in temporary files. The join operation is run on the in-memory hash tables and, once complete, the remaining buckets are loaded into memory and joined on. In this way, a large join is broken up in to a series of smaller joins.

Aggregate operators are executed by each node in the cluster, and the results combined into a final answer at a single node.

  1. What are the final results and conclusion of the paper?

Gamma uses two key ideas for enabling a scalable architecture. First, relations are partitioned across nodes in the cluster, allowing for parallel data scans without any specialized hardware. Second, Gamma popularized hash-based parallel algorithms for join and aggregate operators. The shared-nothing architecture continues to be the dominant form of scaling databases.

Question:
A more thorough explanation of how each of the relational operators is executed within the cluster. The paper provides a short summary of the techniques but I would like to read a more detailed summary.

Back to subcontents CS 540


02.28-2

review_Kaibo Liu_02.28_MapReduce: Simplified Data Processing on Large Clusters

  1. What is the problem discussed in the paper?
    The motivation of MapReduce is to provide an automatical way for parallel execution on a large cluster of commodity machines. Since small commodity PCs became popular, it was not a good idea to run applications with high resource requirment on a few super computers any more. But these small commodity PCs come with less capacity and weaker capability, so it is desirable to find a way to make things fault tolerant and make this large scale cluster easy to program with.

  2. What are the main ideas of the proposed solution for the problem?
    The paper presents MapReduce, a simple yet powerful programming interface which enables automatic parallelization and distribution of large scale computations. It also allows implementation on large scale of commodity PCs to achieve high performance. All the magic here is to provide just two user-defined functions – map and reduce, and MapReduce framework will take over the rest.

  3. What are the final results and conclusion of the paper?
    MapReduce is a general and easy-to-use programming paradigm for distributed computing, in particular for data-intensive applications. Developers only need to write their own map and reduce functions and they are done.
    MapReduce is a flexible framework. There are no dependencies between different map tasks or reduce tasks, so it is easy to re-run failed tasks or even run speculative tasks at the same time without incurring much overhead.

Definitely, MapReduce is the most popular distributed computing framework so far, but some applications may not fit in this architecture. Compared to Dryad, which models a job as an arbitrary task graphs, MapReduce seems to be a special case of Dryad, say it is just like bipartite graph which consists of just two stages of tasks – map and reduce. As the cluster scales up, the master could easily become the bottleneck and get overloaded. The functionality of master includes at least two parts: in charge of assigning the resources in the cluster to tasks and keeping track of the progress of tasks. For example, in Apache Hadoop implementation, the workers (TaskTrackers in Hadoop) need to periodically exchange resource and task information with the master (JobTracker in Hadoop) via heartbeat, and the number of heartbeats the master can process in a second is limited. So it is easy to imagine the workers will get a delayed response from the master in a large scale cluster.

Question:

Back to subcontents CS 540


03.02

review_Kaibo Liu_03.02_Scalable SQL and NoSQL Data Stores

This paper examined a number of SQL and socalled “NoSQL” data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.

  1. NoSQL Motivation Originally motivated by Web 2.0 applications. The goal is to scale simple OLTP-style workloads to thousands or millions of users, and users are doing both updates and reads

  2. What is the Problem? Scaling a relational DBMS is hard. Since scaling queries with parallel DBMSs is hard, it’s much more difficult to scale transactions. Because we need to ensure ACID properties which is hard to do beyond a single machine. There are six key features: 1) Scale horizontally “simple operations” – key lookups, reads and writes of one record or a small number of records, simple selections. 2) Replicate/distribute data over many servers. 3) Simple call level interface (contrast w/ SQL). 4) Weaker concurrency model than ACID. 5) Efficient use of distributed indexes and RAM. 6) Flexible schema.

  3. Terminology
    • Sharding : horizontal partitioning by some key, and storing records on different servers in order to improve performance
    • Horizontal scalability : distribute both data and load over many servers
    • Vertical scaling : when a dbms uses multiple cores and/or CPUs
  4. Different Types of NoSQL
    • Taxonomy based on data models: Key-value stores(e.g., Project Voldemort, Memcached),
    • Document stores(e.g., SimpleDB, CouchDB, MongoDB)
    • Extensible Record Stores(e.g., HBase, Cassandra, PNUTS)
  5. High-Level Differences between SQL and NoSQL: - SQL databases are primarily called as Relational Databases (RDBMS); whereas NoSQL database are primarily called as non-relational or distributed database.
    • SQL databases are table based databases whereas NoSQL databases are document based, key-value pairs, graph databases or wide-column stores. This means that SQL databases represent data in form of tables which consists of n number of rows of data whereas NoSQL databases are the collection of key-value pair, documents, graph databases or wide-column stores which do not have standard schema definitions which it needs to adhered to.
    • SQL databases have predefined schema whereas NoSQL databases have dynamic schema for unstructured data.
    • SQL databases are vertically scalable whereas the NoSQL databases are horizontally scalable. SQL databases are scaled by increasing the horse-power of the hardware. NoSQL databases are scaled by increasing the databases servers in the pool of resources to reduce the load.
    • SQL databases uses SQL ( structured query language ) for defining and manipulating the data, which is very powerful. In NoSQL database, queries are focused on collection of documents. Sometimes it is also called as UnQL (Unstructured Query Language). The syntax of using UnQL varies from database to database.
    • SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-SQL. NoSQL database examples: MongoDB, BigTable, Redis, RavenDb, Cassandra, Hbase, Neo4j and CouchDb. For complex queries: SQL databases are good fit for the complex query intensive environment whereas NoSQL databases are not good fit for complex queries. On a high-level, NoSQL don’t have standard interfaces to perform complex queries, and the queries themselves in NoSQL are not as powerful as SQL query language.
    • For the type of data to be stored: SQL databases are not best fit for hierarchical data storage. But, NoSQL database fits better for the hierarchical data storage as it follows the key-value pair way of storing data similar to JSON data. NoSQL database are highly preferred for large data set (i.e for big data). Hbase is an example for this purpose.
    • For scalability: In most typical situations, SQL databases are vertically scalable. You can manage increasing load by increasing the CPU, RAM, SSD, etc, on a single server. On the other hand, NoSQL databases are horizontally scalable. You can just add few more servers easily in your NoSQL database infrastructure to handle the large traffic.

Question:

Back to subcontents CS 540


03.07

review_Kaibo Liu_03.07_Bigtable: A Distributed Storage System for Structured Data

What is the problem addressed?
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Why important?
Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Main technical contributions:
A Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp. The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights. The key contributions may be the decision of implementation. The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers.

Tablet servers can be dynamically added (or removed) from a cluster to accommodate changes in workloads. The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations. Each tablet server manages a set of tablets. The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. They use a three-level hierarchy analogous to a B+- tree to store tablet location information. The first level is a file stored in Chubby that contains the location of the root tablet which will never split to make sure there are three levels.

Weaknesses or open questions:
I would like GFS better which presents implementation better. It would be good if the paper can pose the current challenges and how they overcome them would give reader more clear view about why they need to do the jobs all over again.

Question:

Back to subcontents CS 540


03.07-2

review_Kaibo Liu_03.07_Dynamo: Amazon’s Highly Available Key-Value Store_section4.4+4.5

What is the problem addressed?
Design and Implementation of a tunable Distributed key value store to suit the heterogeneous applications on Amazon’s platform. Design considerations: The focus is on highly availability and fault tolerance. Eventual consistency adopted.

Summary of the paper:
The system uses consistent hashing among virtual nodes(to support uniform load distribution) to partition the key values. The data is replicated among N nodes in a preference list. Objects are versioned and conflicts are resolved by the user. Causality between different versions captured using vector clocks. Sloppy quorum used to write/read from first W/R nodes where W+R > N. Hinted handoff is used to ensure good distribution in the presence of transient failures.

Permanent failures are handled using Merkle trees. Anti entropy gossip based schemes used to announce addition/removal of nodes in the system. A cache is added to balance performance vs durability. Three strategies are discussed to ensure uniform load distribution in the presence of few popular keys.

Main technical contributions:

  1. Integration a slew of techniques such as consistent hashing, replication, merkle trees, anti entropy algorithms, sloppy quorum, object versioning in a production environment
  2. A system with tunable parameters –R,W,N to adopt to the needs of heterogeneous applications
  3. A hands on account of how to balance between conflicting needs – performance and durability, background vs foreground tasks
  4. A partition aware client library to route requests to the coordinator directly
  5. Has shown techniques that can Scales to Amazon’s environment

Relevance to distributed systems:

  • Having been deployed and run in a challenging and varied application environment such as Amazon, it serves as a model for an eventually consistent data store
  • The ability to tune parameters is incorporated to suit variety of applications
  • Techniques to tolerate transient and permanent faults are implemented.
  • Addresses scalability challenges.

Question:

Back to subcontents CS 540


review_Kaibo Liu_03.09_The PageRank Citation Ranking: Bringing Order to the Web

  1. what problem does page rank solve?
    The World Wide Web creates many new challenges for information retrieval. It is very large and heterogeneous.Hence, it becomes difficult to find out the most relevant pages on the top of the result page of search engine. Page rank is a method for rating the importance of web pages by using the link structure of the web. It finds its application in estimating web traffic, back link prediction, user navigation and many information retrieval task.

  2. Technicalities:
    2.1 Algorithm: Link structure of Web Page:Every page has some number of forward links (out edges) and back links (in edges) .Forward links can be known at that time when the network graph is downloaded but backward links cannot be estimated.A page rank algorithm states that a page is important if important link refers to it. It is an algorithm which assigns a numerical weight to the page to represent the importance of the page.It forms a probability distribution over web pages, so that the sum of all the page ranks is one. Also, a page rank value for a page u is dependent on the PageRank values for each page v contained in the set containing all pages linking to page u, divided by the number of links from page v.The PageRank theory states that an imaginary surfer who is randomly clicking on links will eventually stop clicking.A vector of pages that the surfer algorithm jumps to also gets added to this equation.At each step, the probability that the person will continue is a damping factor.The damping factor is subtracted from 1 and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores.
    This score is calculated each time it crawls the Web and rebuilds its index.
    If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes PageRank a particularly elegant metric.
    The algorithm uses convergence property assuming that the Web is an expander-like graph.It explains the theory of random walk is rapidly-mixing if it quickly converges to a limiting distribution on the set of nodes in the graph.A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue.

2.2 Issues of Dangling Links: Links that point to any page with no outgoing links affect the model since it is not clear where their weight should be distributed. A solution to this is to remove them before page rank calculation and added back afterwards.The algorithm sorts the link structure by ID and makes an initial assignment of ranks and start iteration.

2.3 Experimental Evaluations: The authors also experiments to give out results stating that the distribution of web pages that the random surfer periodically jumps to is an important component of the page rank algorithm.Having this parameter uniform all over web pages results in many related links with high ranks while a personal page rank consist of a single page.

  1. Weakness: New pages may be assigned less page rank and they take much time to be get listed and gain high ranks.Search results are based on the literal things but not on meaning.Using this algorithm, it might be easy to manipulate the search results if an organization has time to increase the ranking of their website.

Question:

Back to subcontents CS 540


review_Kaibo Liu_03.09_Efficient IR-Style Keyword Search over Relational Databases

Question:

https://web.eecs.umich.edu/~mozafari/fall2015/eecs584/reviews/summaries/

Back to subcontents CS 540
Back to CONTENTS


Assignment 4

gantt
dateFormat DD
section T1
R(X): 01, 2d
R(Y): 09, 2d
section T2
R(Y): 03, 2d
R(X): 07, 2d
section T3
W(X): 05, 2d
gantt
dateFormat DD
section T1
R(X): 01, 1d
R(Y): 02, 1d
W(X): 03, 1d
W(X): 06, 1d
section T2
R(Y): 04, 1d
R(Y): 07, 1d
section T3
W(Y): 05, 1d
gantt
dateFormat DD
section T1
W(X): 01, 1d
W(X): 03, 1d
Commit: 05, 1d
section T2
R(X): 02, 1d
Commit: 04, 1d
gantt
dateFormat DD
section T1
R(X): 01, 1d
W(X): 03, 1d
Commit: 05, 1d
section T2
W(X): 02, 1d
Commit: 06, 1d
section T3
R(X): 04, 1d
Commit: 07, 1d
graph TD
DB-->R1
DB-->R2
R1-->t1
R1-->t2
R2-->t3
R2-->t4
R2-->t5

Back to subcontents CS 540
Back to CONTENTS


Assignment 5

02/25/2017 Sat

Setting Up Hadoop Environment

  1. log in to server ssh liukaib@hadoop-master.engr.oregonstate.edu.
  2. add the following to ~/.cshrc
    setenv JAVA_HOME "/usr/lib/jvm/java-1.8.0"
    setenv PATH "$JAVA_HOME/bin:$PATH"
    setenv CLASSPATH ".:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar"
    setenv HADOOP_HOME /opt/hadoop/hadoop
    setenv HADOOP_COMMON_HOME $HADOOP_HOME
    setenv HADOOP_HDFS_HOME $HADOOP_HOME
    setenv HADOOP_MAPRED_HOME $HADOOP_HOME
    setenv HADOOP_YARN_HOME $HADOOP_HOME
    setenv HADOOP_OPTS "-Djava.library.path=$HADOOP_HOME/lib/native"
    setenv HADOOP_COMMON_LIB_NATIVE_DIR $HADOOP_HOME/lib/native
    setenv PATH $HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH
    setenv HADOOP_CLASSPATH "${JAVA_HOME}/lib/tools.jar"
    

    don’t forget source ~/.cshrc

    But, the shell in server is bash (use echo $0 to display current shell). When I use source ~/.cshrc, there is a syntax error with set path even I switch shell to csh with exec /bin/csh. I didn’t know how to deal with it. Then I put all the variables into ~/.bashrc. However, setenv is not supported in bash so I can’t source it in bash. The final solution is, to change setenv into export $var as bash format: In ~/.bashrc (or the first of ~/.bash_profile, ~/.bash_login, and ~/.profile that exists), source this script (saved as something like ~/sourcecsh) using . ~/sourcecsh:

    #!/bin/bash
    # This should be sourced rather than executed
    while read cmd var val
    do
     if [[ $cmd == "setenv" ]]
     then
         eval "export $var=$val"
     fi
    done < ~/.cshrc
    
  3. unzip the assignment package downloaded in home path with wget [url]
  4. manipulate files in hadoop server with special commands.

Back to subcontents CS 540

02/25/2017 Sat

Hadoop implementation

Working with HDFS You have a folder on HDFS server at /user/cs540/. Note that files on HDFS can be manipulated only using the special commands that are given below. Throughout the assignment, you should just use this directory when you need to work with HDFS. You can upload files or write output of your jobs to your directory. Note that this directory is not being backed up. Following is a list of commands that you can use to interact with HDFS: Here, is `/user/cs540/liukaib`

  • View list of files and folders:
    hdfs dfs -ls <HDFS path>
  • Upload a file to HDFS:
    hdfs dfs -put <file on engr account> <HDFS path>
  • Download a file from HDFS:
    hdfs dfs -get <HDFS path>/<file_name>
  • View file on HDFS:
    hdfs dfs -cat <HDFS path>/<file_name>
  • Make a new directory:
    hdfs dfs -mkdir <HDFS path>/<folder_name>
  • Remove a file:
    hdfs dfs -rm <HDFS path>/<file_name>
  • Remove a directory:
    hdfs dfs -rm -r <HDFS path>/<folder_name>

Compiling and Running

Once you have finished your implementation you can run the following commands to compile your code and create a jar file:

hadoop com.sun.tools.javac.Main PageCount.java
jar cf pc.jar PageCount*.class

Then upload the input file input.csv to your HDFS folder:
hdfs dfs -put input.csv <HDFS path>/

Next you you can run the application:

hadoop jar pc.jar PageCount <HDFS path>/input.csv <HDFS path>/output

Note that the output directory should not exist before running the above command. Otherwise you will get an error. After the job is finished you can view the output file:
hdfs dfs -cat <HDFS path>/output/part-r-00000

Depending on the number of reducers you may get more than one output files. Use the list command mentioned in the previous section to go through different output files.

  • You can put your code in a directory under your engineering account’s home folder. This way, after logging in to the Hadoop server, you can access your code in your engineering home folder and you can edit and compile the code there. However, the input file should be uploaded to the HDFS using the given commands.
  • After running your job you can view its status using a web interface. To do this, login to Hadoop server. If you are a mac user, you need to add -X to ssh command to enable X11 services. After you logged in, type firefox in the terminal. This should open a firefox windonw. Then go to \http://hadoop-master.engr.oregonstate.edu:8088/ in the firefox.
    You should be able to see information on your hadoop jobs.

Back to subcontents CS 540
Back to CONTENTS


CS 540 Project

Architecture

sequenceDiagram
    opt input
        Note left of User: user has something
        User->>System: Keywords
        Note right of System: SELECT * FROM DB WHERE keywords
        System-->>System: Predict in network
        System-->>User: Return (top k enterties)
        User->>System: Choose/Click from k enteries
        System-->>System: Train with new data
    end

    opt no input
        Note left of User: user has no clue
        User->>System: Void keyword
        Note right of System: project database
        System-->>System: Predict in network
        System-->>User: Return (top k enterties)
    end

sequenceDiagram Note left of User: user has something User-»System: Keywords Note right of System: SELECT keywords in database -> project System–»System: Predict in network System–»User: Return (top k enterties) User-»System: Choose/Click from k enteries System–»System: Train with new data Note left of User: user has no clue User-»System: Void keyword Note right of System: project database System–»System: Predict in network System–»User: Return (top k enterties) )

Back to subcontents CS 540
Back to CONTENTS


CS 519 Deep Learning

02/04/2017 Sat

Assignment 2 of CS519

cPickle.load(open(“cifar_2class_py2.p”,”rb”))取出的dict数组有4个成员 dict["train_data"], dict["train_labels"], dict["test_data"], dict["test_labels]。 通过print type(variable_name)发现,每一个成员都是一个numpy的二维数组,其中_data每行为一幅图,行数为sample数,_labels为对应图的标签airplane-0/ship-1。可以reshape后看图片。
运行发现没有matplotlib,于是安装。

Back to subcontents CS 519

02/05/2017 Sun

Assignment 2 of CS519

无数次试错输出CIFAR-10图形。通过查阅得知data的每行数据(一幅图)中,3072个数据的前1024个数据为该32××32图像的red通道,而随后的1024个数据为green通道,最后的1024个数据为blue通道。

所以reshape时要先分成3份,再32份(行h)和32份(列w),再然后转置.T或者.transpose(1,2,0),表示分成的3份要放到最后一个轴,形成(32X32X3),经过这样变换,我们可以很方便的获取想要的图像和图像中对应像素的值。

输出时,用plt(import matplotlib.pyplot as plt)imshow,但是图像很模糊有马赛克。虽然分辨率本身很低但是和网上CIFAR的样图还是有差距,于是查到plt.imshow有个参数interpolation,置'nearest''none'时都是不差值直接按原像素显示,'spline36''sinc'等都会根据不同差值算法来模糊。最后选了个还不错的用来visualization。

Back to subcontents CS 519

02/06/2017 Mon

Assignment 2 of CS519

Batch Training: The gradients calculated at each training example are added together to determine the change in the weights and biases. For a discussion of batch training with the backpropagation algorithm see page 12-7 of [HDB96].

A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a “mini-batch”) at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

b=10 #mini-batch size, get b examples, x(i),y(i)…x(i+b-1),y(i+b-1) wj=wj-a\sigma△wj/b

Summing the gradients due to individual samples you get a much smoother gradient. The larger the batch the smoother the resulting gradient used in updating the weight.

Dividing the sum by the batch size and taking the average gradient has the effect of:

  1. The magnitude of the weight does not grow out of proportion. Adding L2 regularization to the weight update penalizes large weight values. This often leads to improved generalization performance. Taking the average, especially if the gradients happen to point in the same direction, keep the weights from getting too large.
  2. The magnitude of the gradient is independent of the batch size. This allows comparison of weights from other experiments using different batch sizes.
  3. Countering the effect of the batch size with the learning rate can be numerically equivalent but you end up with a learning rate that is implementation specific. It makes it difficult to communicate your results and experimental setup if people cannot relate to the scale of parameters you’re using and they’ll have trouble reproducing your experiment.

Averaging enables clearer comparability and keeping gradient magnitudes independent of batch size. Choosing a batch size is sometimes constrained by the computational resources you have and you want to mitigate the effect of this when evaluating your model.

Neural Networks: weight change momentum and weight decay

  • one epoch = one forward pass and one backward pass of all the training examples
  • batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.
  • number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

1 epoch=ierations*batch size

Back to subcontents CS 519

02/10/2017 Tue

Assignment 2 of CS519 Architecture

graph LR
id1((x:d*batch))-->id2((z1:m*batch))
id2((z1:m*batch))-->id3((z2:m*batch))
id3((z2:m*batch))-->id4((z3:1*batch))
id4((z3:1*batch))-->id5((p,e:1*batch))
graph LR
id1(W:d*m)-->id0((hidden layer))
id2(b:m*1)-->id0((hidden layer))
id0((hidden layer))-->id3(w:m*1)
id0((hidden layer))-->id4(c:1*1)

Back to subcontents CS 519

Back to CONTENTS


Assignment3

02/25/2017 Sat

vim ~/.theanorc, add:

[global]
floatX = float32
device = gpu0

Delete the device line if you want to use cpu only. (json has no mask/commend symbol)

vim ~/.bashrc, add CUDA path:

export PATH=$PATH:/usr/local/eecsapps/cuda/cuda-7.5.18/bin
export LD_LIBRARY_PATH=/usr/local/eecsapps/cuda/cuda-7.5.18/lib64

Above path is on pelican.eecs.oregonstate.edu, the CUDA root is /usr/local/eecsapps/cuda/cuda-7.5.18.
For TitanX server, CUDA root is /usr/local/cuda-8.0/ vim ~/.keras/keras.json is to switch backend, commend out "image_dim_ordering": "tf",

In TitanX server, there are encoding issues loading data, no matter python2 or python3. I add CUDA path (found with which nvcc) to .bashrc, then default python2 is able to use GPU in keras as well.(seems useless because python2 can use GPU without this path as well)

  • Use theano as backend, display Using gpu device 0: TITAN X (Pascal) (CNMeM is disabled, cuDNN 5110), error is ValueError: negative dimensions are not allowed(line 61, model.add(Dense(512)))
  • Use tensorflow as backend, display opened CUDA library libcublas.so.8.0 locally, error is ValueError: Negative dimension size caused by subtracting 3 from 1 for 'Conv2D_1' (op: 'Conv2D') with input shapes: [?,1,16,32], [3,3,32,32]. (line 48, model.add(Convolution2D(32, 3, 3)))

However, in pelican server, everything seems correct, but runtime for an epoch is ~500s, neither >2000s nor <50s. Wired…

FUCK!! I didn’t use source after modifying ~/.bashrc

After correct configuration in pelican, runtime for an epoch is 15s (12~13s if no processes occupied by other users).

Back to subcontents CS 519

02/27/2017 Mon

With previous error in TitanX, Zheng and I noticed the warning when running the .py with theano backend:
UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.

So I tried to disable cuDNN for abandoning the acceleration:

THEANO_FLAGS=optimizer_excluding=cudnn python cifar10_cnn.py

No improvement…

According to tensorflow error information tensorflow backend: ValueError: Negative dimension size caused by subtracting 3 from 1 for 'Conv2D_1' (op: 'Conv2D') with input shapes: [?,1,16,32], [3,3,32,32]. Try to display dimension of layer in both TitanX server and pelican server, edit following in py code:

model.add(AveragePooling2D(pool_size=(2,2), input_shape=(img_channels, img_rows, img_cols)))
print(model.layers[-1].output_shape)
model.add(Convolution2D(32, 3, 3, border_mode='same'))
print(model.layers[-1].output_shape)

[Out]:

|Layer|TitanX server|Pelican server| |:-:|:-:|:-:| |keras.__version__|1.2.2|0.3.2| |model.add(AveragePooling2D())|(None, 1, 16, 32)|(None, 3, 16, 16)| |model.add(Convolution2D())|(None, 1, 16, 32)|(None, 32, 16, 16)| |MaxPooling2D Input shape(‘th’ order)|(samples, channels, rows, cols)|(samples, channels, rows, cols)| |MaxPooling2D Input shape(‘tf’ order)|(samples, rows, cols, channels)|(samples, channels, rows, cols)| So, the reason is, keras in TitanX is too recent (version 1.2.2) for Assignment’s code, which is only compatible to older version(like 0.3.2 in pelican), and the output dimension of AveragePooling2D function is different between the two versions.

  • version check: in python, import keras and then keras.__version__

    Solution:

    install a older version of keras

    pip uninstall keras
    pip install keras==0.3.2
    

    everything goes well:

GPU Titan X ~ GTX 980 Ti
data augmentation Not using Using real-time Not using
time/epoch 7-8s 10-11s 12-13s

Back to subcontents CS 519

another find with the help of Lawrance:

03/02/2017 Thu

MaxPooling2D Input shape
4D tensor with shape: (samples, channels, rows, cols) if dim_ordering=’th’
or
4D tensor with shape: (samples, rows, cols, channels) if dim_ordering=’tf’.

So, I believe keras has change the order of arguments in 4D shape from (samples, channels, rows, cols)/v-0.3.2 to (samples, rows, cols, channels)/v-1.2.2 if dim_ordering='tf'.(In older version, they are compatible with ‘tf’).

So, for the keras version 1.2.2, I need to switch dim_ordering according to the backend.

02/28/2017 Tue

In the assignment, we need to save and load a model.

However, the version in pelican and only works for Fuxin’s code is 0.3.2. Meanwhile this version doesn’t support the newer function in version 1.2.2

from keras.models import load_model
model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
### model.save() to save a Keras model into a single HDF5 file which will contain:
### - the architecture of the model, allowing to re-create the model
### - the weights of the model
### - the training configuration (loss, optimizer)
### - the state of the optimizer, allowing to resume training exactly where you left off.
del model  # deletes the existing model
model = load_model('my_model.h5')

So I cannot use load_model for now. I searched some methods to save/load architecture of model(JSON) and weights(HDF5) seperaterly.

# serialize model to JSON
json_model = model.to_json()
with open(Dir+'model.json', "w") as json_file:
#    json.dump(json_model,json_file)
    json_file.write(json_model)
# serialize weights to HDF5
model.save_weights(Dir+'model.h5')  # creates a HDF5 file for weights 'model.h5'
print("Saved model to disk")

# later...

from keras.models import model_from_json
json_file = open(Dir+'model.json', 'r')
json_model = json_file.read()
json_file.close()
model = model_from_json(json_model)

# load weights into new model
model.load_weights(Dir+'model.json')  # loads a HDF5 file '*model.h5' for weights
print("Loaded model from disk")

When writing json into file, if I use json.dump(json_model,json_file) instead of json_file.write(json_model), saving will be OK, but later when I need to load the json file, there is an error AttributeError: 'unicode' object has no attribute 'get' in model_from_json(json_model)

There is another confusion in loading model with the online code:

def train():
   model = create_model()
   sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
   model.compile(loss='binary_crossentropy', optimizer=sgd)

   checkpointer = ModelCheckpoint(filepath="/tmp/weights.hdf5", verbose=1, save_best_only=True)
   model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose=2, callbacks=[checkpointer])

def load_trained_model(weights_path):
   model = create_model()
   model.load_weights(weights_path)

2

Add/remove/insert a layer:

After loading the model and weights to a new model, compiling first or adding/removing/inserting a layer first is a problem.

In the experiment of question 2), I found that model.compile(with SDG) after inserting will get an error. ValueError: GpuElemwise. Input dimension mis-match. Input 3 (indices start at 0) has shape[1] == 10, but the output's size on that axis is 512. So I put model.compile(with SDG) first, then insert a layer(pop,add,add)

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
layer1 = model.layers.pop()
model.add(Dense(512))
model.add(layer1)
--------------------------------------------------------------------------------
Initial input shape: (None, 3, 32, 32)
--------------------------------------------------------------------------------
Layer (name)                  Output Shape                  Param #
--------------------------------------------------------------------------------
AveragePooling2D (averagepooli(None, 3, 16, 16)             0
Convolution2D (convolution2d) (None, 32, 16, 16)            896
Activation (activation)       (None, 32, 16, 16)            0
Convolution2D (convolution2d) (None, 32, 14, 14)            9248
Activation (activation)       (None, 32, 14, 14)            0
MaxPooling2D (maxpooling2d)   (None, 32, 7, 7)              0
Dropout (dropout)             (None, 32, 7, 7)              0
Convolution2D (convolution2d) (None, 64, 7, 7)              18496
Activation (activation)       (None, 64, 7, 7)              0
Convolution2D (convolution2d) (None, 64, 5, 5)              36928
Activation (activation)       (None, 64, 5, 5)              0
MaxPooling2D (maxpooling2d)   (None, 64, 2, 2)              0
Dropout (dropout)             (None, 64, 2, 2)              0
Flatten (flatten)             (None, 256)                   0
Dense (dense)                 (None, 512)                   131584
Activation (activation)       (None, 512)                   0
Dense (dense)                 (None, 10)                    5130
Dense (dense)                 (None, 512)                   5632
Activation (activation)       (None, 512)                   0
--------------------------------------------------------------------------------
Total params: 207914
--------------------------------------------------------------------------------

3

Model display
the following instruction seems good as figure. But is not supported in keras 0.3.2 with plot import

from keras.utils import np_utils
plot(model, to_file=saveDir+'%s_model_loaded.png' %(str(quesNo)))

just use model.summary() for print the architecture.

03/02/2017 Thu

see another find, I upgrade the keras to the most current version 1.2.2, fix the dim_ordering='th' in ~/.keras/keras.json, and use the new load_model instead of seperating weight and structure.

model.save_weights('model_1.h5')
model1.load_weights('model_1.h5') # ok
model.save('model_2.h5')
from keras.models import load_model
model2 = load_model('model_1.h5') # error, 'model_1.h5' only have weights, load_model will not find a matched all-model with this name
model2 = load_model('model_2.h5') # ok

But, keras 1.2.2 has another problem:

UserWarning: The "show_accuracy" argument is deprecated, instead you should pass the "accuracy" metric to the model at compile time:
`model.compile(optimizer, loss, metrics=["accuracy"])`

or the result will only has loss rather than acc. So I add the metrics=["accuracy"] as the last argument of model.compile.

problems in model.add()

There are two ways to save trained model,as posted before(02/28/2017 Tue and 03/02/2017 Thu ):
One is to save/load architecture of model(JSON) and weights(HDF5) seperaterly.

model.to_json()
model.save_weights('m.h5')
model.load_weights('m.h5')

The other is to seve/load the whole model using model.save('m.h5') and load_model('m.h5')

After loading, we can do pop to remove the last layer of the model.
But
l0 = model.layers.pop() only pop the layer node, not the connection.
mode1:

graph LR

id3(input)-.->L1
L1-->L2
L2-->L3
L3-->L4
L4-.->id4(L4.output)

id1(input)-.->id0((model1))
id0((model1))-.->id2(L4.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

model2 = Sequential(model1):

graph LR
id3(input)-.->L1
L1-->L2
L2-->L3
L3-->L4
L1-->L2
L2-->L3
L3-->L4
L4-.->id4(L4.output)
id1(input)-.->id0((model2))
id0((model2))-.->id2(L4.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

layers and model output are not consistent in one operation pop. model1.layers.pop() only pop the layer node, not the connection.

graph LR

id3(input)-.->L1
L1-->L2
L2-->L3
L3-.->id4(L4.output)

id1(input)-.->id0((model1))
id0((model1))-.->id2(L4.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

model1.add(l5)

graph LR

id3(input)-.->L1
L1-->L2
L2-->L3
L3-->L5
L3-.->id4(L4.output)

id1(input)-.->id0((model1))
id0((model1))-.->id2(L4.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

So, we need to do like this:

model1.layers.pop()
model1.outputs = [model.layers[-1].output]
model1.layers[-1].outbound_nodes = []
graph LR

id3(input)-.->L1
L1-->L2
L2-->L3
L3-.->id4(L3.output)

id1(input)-.->id0((model1))
id0((model1))-.->id2(L3.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

And, there is no need to use l0 anymore.

name set in model.add()

In keras 1.2.2, if you load a model and add some layer, you need to give it a new name to avoid duplicate name because the loaded one may have been named as dense_1, and once you use .add(dense(512)), it will give ‘dense_1’ again. The right way should be .add(dense(512,name="dense_new"))

Adam para in keras 1.2.2

In keras 0.3.3, I used adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.01) as optimizer in compile and everything looks good.
However, in keras 1.2.2, this will make loss nan and get bad result, since parameter is decayed soon.
So, after I fix dacay to 1e-6, it turned out right. adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=1e-6)

Local norm before each acvitation

Keras now supports the bias=False option, so we can save some computation by writing like

model.add(Dense(64, bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('tanh'))

|quesNo|layer|optimizer|other|err|val_err| |:-:|:-:|:-:|:-:|:-:|:-:| |3|as # 1|Adam(decay=1e-6)|7s||| |4|as #3|Adam(decay=0)|7s||| |4.1|4+ add FC512+ReLU, before FC 10|Adam|7s||| |4.2|4+ add dropout,after both FC 512|Adam|7s|0.307|0.298| |4.3|4.2+ReLu->sig in both FC 512|Adam|7s|-|-| |4.4|4.3+32->64 in 1st conv+512->128 in 1st FC|Adam|7s|0.294|0.285| |4.5|#4.4+data augmentation|Adam|7s||| |4.6|4.2+ 512->256 in 2nd FC, before FC 10|Adam|7s|0.294|0.281| |4.7|4.3+ 512->256 in 2nd FC, before FC 10|Adam|7s|0.267|0.266| |4.8’|4.7+add conv(128) as a 3rd|Adam|error|after 4 pooling, image 32->2, cannot conv|-| |4.9|4.7+local norm before each activ|Adam|52s|0.213|0.243| |4.11|4.7+remove AveragePooling2D, add conv(128) as a 3rd|Adam|26s|rise|rise| |4.12|4.9+remove AveragePooling2D, add conv(128) as a 3rd|Adam|170s|0.073|0.152| |4.13|#4.11+data augmentation|Adam|170s|0.107|0.144| |4.14|conv(48->96->192)+FC(512->256->10)+data aug|Adam|28s|0.225|0.216| |4.15|as # 4.14|default Adam|28s|0.225|0.216|

4.11<–>4.14 diff is the # of feature filters in conv

Question 4: load from 3_model, use data augmentation, to prevent overfitting in training data

Adam decay=0.01 will make acc drop to constant, should be 1e-6, and simply, make it a default adam.

03/06/2017 Mon

complete the running, start to write report.

graph LR

id3(input)-.->L1
L1-->L2
L2-->L3
L3-.->id4(L3.output)

id1(input)-.->id0((model1))
id0((model1))-.->id2(L3.output)

style id1 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id0 fill:#f9f,stroke:#333,stroke-width:4px;
style id2 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id3 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
style id4 fill:#ccf,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;

#1

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512,ReLu)")
        e2 --> e3(Dropout)
        e3 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64,ReLu)") --> c2("Conv2D(64,ReLu)")
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32,ReLu)") --> b2("Conv2D(32,ReLu)")
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    %% classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class A,b1,b2,b3 green
    class e3 del

#2

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512,ReLu)")
        e2 --> e4("Dense(512,ReLu)")
        e4 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64,ReLu)") --> c2("Conv2D(64,ReLu)")
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32,ReLu)") --> b2("Conv2D(32,ReLu)")
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class e4 green

#4.2

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512,ReLu)")
        e2 --> e3(Dropout)
        e3 --> e4("Dense(512,ReLu)")
        e4 --> e5(Dropout)
        e5 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64,ReLu)") --> c2("Conv2D(64,ReLu)")
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32,ReLu)") --> b2("Conv2D(32,ReLu)")
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class e3,e5 green

#4.3

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512,Sigmoid)")
        e2 --> e3(Dropout)
        e3 --> e4("Dense(512,Sigmoid)")
        e4 --> e5(Dropout)
        e5 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64,ReLu)") --> c2("Conv2D(64,ReLu)")
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32,ReLu)") --> b2("Conv2D(32,ReLu)")
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class e2,e3,e4,e5 green

#4.7

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512,Sigmoid)")
        e2 --> e3(Dropout)
        e3 --> e4("Dense(256,Sigmoid)")
        e4 --> e5(Dropout)
        e5 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64,ReLu)") --> c2("Conv2D(64,ReLu)")
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32,ReLu)") --> b2("Conv2D(32,ReLu)")
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class e4 green

#4.9

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512)"<br>Local Norm<br>Act:Sigmoid)
        e2 --> e3(Dropout)
        e3 --> e4("Dense(256)"<br>Local Norm<br>Act:Sigmoid)
        e4 --> e5(Dropout)
        e5 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64)"<br>Local Norm<br>Act:ReLu) --> c2("Conv2D(64)"<br>Local Norm<br>Act:ReLu)
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32)"<br>Local Norm<br>Act:ReLu) --> b2("Conv2D(32)"<br>Local Norm<br>Act:ReLu)
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    subgraph Pooling
        a(AveragePooling2D)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class e2,e4,c1,c2,b1,b2 green

#4.12

graph TB
    subgraph Fully-connected block
        e1(Flatten) --> e2("Dense(512)"<br>Local Norm<br>Act:Sigmoid)
        e2 --> e3(Dropout)
        e3 --> e4("Dense(256)"<br>Local Norm<br>Act:Sigmoid)
        e4 --> e5(Dropout)
        e5 --> e6("Dense(10,Softmax)")
    end
    subgraph Conv block 3: 128
        d1("Conv2D(128)"<br>Local Norm<br>Act:ReLu) --> d2("Conv2D(128)"<br>Local Norm<br>Act:ReLu)
        d2 --> d3(MaxPooling2D)
        d3 --> d4(Dropout)
    end
    subgraph Conv block 2: 64
        c1("Conv2D(64)"<br>Local Norm<br>Act:ReLu) --> c2("Conv2D(64)"<br>Local Norm<br>Act:ReLu)
        c2 --> c3(MaxPooling2D)
        c3 --> c4(Dropout)
    end

    subgraph Conv block 1: 32
        b1("Conv2D(32)"<br>Local Norm<br>Act:ReLu) --> b2("Conv2D(32)"<br>Local Norm<br>Act:ReLu)
        b2 --> b3(MaxPooling2D)
        b3 --> b4(Dropout)
    end
    %%a --> b1
    %%b4 -->c1
    %%c4 --> e1
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;;
    %% classDef orange fill:#f48c42,stroke:#f66,stroke-width:2px,stroke-dasharray: 5, 5;
    classDef del fill:#a4a8a7,stroke:#f66,stroke-width:3px,stroke-dasharray: 5, 5;
    class d1,d2,d3,d4 green

Back to subcontents CS 519
Back to CONTENTS


CS 519 Project

Back to subcontents CS 519
Back to CONTENTS