Total Hits

Saturday, March 19, 2011

Principal Component Analysis: Handfull Datamining Technique

Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to transform a series of observations of the variables may be correlated with the values ​​specified in a non-correlated variables called principal components. Number of principal components is less than or equal to the original number of variables. This transition has been determined in such a way that the first principal component is as high a variance as possible (ie, the variability of the data as many bills as possible), and each succeeding element of the standard deviation is the highest possible under the constraint that it is perpendicular (not correlated) in the previous parts. Independent principal components are guaranteed only if the series jointly normally distributed. PCA is sensitive to the relative scaling of the original variables. Depending on the scope, it is also known as the discrete Karhunen-Loève transform (KLT), the Hotelling or less, or the proper orthogonal resolution (POD).

PCA was found in 1901, Karl Pearson. Now mostly as a tool in exploratory data analysis and for making predictive models. PCA can be done eigenvalue decomposition of data covariance matrix or singular value resolution of the data matrix, usually after mean centering the data for each attribute. The results are usually discussed in the PCA component scores (transformed to the variable value to a particular case, the data) and loads (the weight with which each original variable must be multiplied by a standarized to the component score).

PCA is the simplest to the real eigenvector-based multivariate analysis. Often, it is thought the operation revealed the internal structure of the data in a way that best explains the variance in the data. If you see a multivariate data set a series of coordinates in a high-dimensional data space (1 axis is a variable), PCA can provide the user with a lower-dimensional image, a "shadow" of the object in terms of the (in some sense) the most informative point of view. This is done by using only the first few principal components, so that the dimension of the transformed data is reduced.

PCA is closely related to factor analysis, and in some statistical packages (such as Stata) knowingly combining the two techniques. True factor analysis makes different assumptions about the underlying structure of the eigenvectors, and solves a slightly different matrix.







Wednesday, March 9, 2011

Hashing -- Hash Function

A hash function is a well-defined procedure or mathematical function that converts a large amount of possibly varying sizes of data within a certain small, usually a single integer, which could serve as an index of an array (associative array cf.). The values returned by a hash function are called hash values, hash codes, hash sums, or simply hash checksum.

Hash functions are mostly used to speed lookup table or data comparison tasks, such as finding objects in a database, to identify duplicate records or similar in a large file, finding similar stretches in DNA sequences , and so on.

A hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of these collisions, which means that the hash function to map keys to hash values as evenly as possible. Depending on the application, other properties may be needed as well. Although the idea was conceived in 1950, the design of good hash functions is still a topic of active research.

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization features, error-correcting codes and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and has been designed and optimized in different ways. The database maintained by the American HashKeeper National Drug Intelligence Center, for example, is more aptly described as a catalog of fingerprints file hash values.

Hash functions are by definition and implementation of pseudo random number generators (PRNG). This generalization of its unacceptable that the Hash function performance and compare hash functions can be achieved by treating the hash function as PRNGs.

Techniques such as Poisson distribution analysis can be used to analyze the collision rates of different hash functions for different sets of data. In general there is a theoretical hash function called hash function is perfect for each set of data. Perfect hash function by definition states that conflicts will not occur any meaningful returns hash values arise from the different elements of the group. In reality very difficult to find a perfect hash function practical applications of perfect hashing and its variant minimal perfect hashing quite limited. Practice is generally recognized that a perfect hash function is a hash function that produces the least amount of conflicts for a certain set of data.

The problem is that there are so many permutations of data types, some other very random, which contains a high degree of patterns that its hard to generalize hash function for all data types, or even for specific data types. Anyone can do it through trial and error to find the hash function that best suites their needs. Some analyze select Dimensions Hash functions are:

*
Distribution data

This is a measure of how well the hash function distributes hash values of elements within a set of data. Analysis of this measure requires knowing the number of collisions that occur with data to define the meaning of non-unique hash values, if chaining is used to resolve the conflict average length of chains (which would in theory be the average of the collision count of every bucket) is analyzed, the amount of grouping of values hash range must be analyzed.
*
Hash Function Efficiency

This is a measure of the effectiveness of the hash function generates hash values for elements within a set of data. When the algorithms that contain hash functions are analyzed assumption is that hash functions typically have a complexity of O (1), which is why, look ups of Hash-table data should be the "average O (1) complexity", which looks like a belly Data associative containers like maps (usually implemented as red black trees) should be complex (logn) O.

Hash function should be a theory of action is very fast, stable and deterministic. Hash function does not always lend itself to be of O (1) complex, but generally linear crossing through a string of data to be hashed is fast so that the hash functions are typically used on primary keys, which by definition are supposed to be the same associative much smaller than the large blocks of data implies that the entire operation should be faster to some degree unstable.

Hash functions in this article are known as simple hash functions. They are typically used for data hashing (string hashing). They are used to create keys that are used associative containers such as hash tables. These hash functions are cryptographically secure, they can easily be reversed in different combinations of data can be found easily produce the same hash values for each combination of data.

Have simple example of hashing... for password matching

Different types of Hashing

Hashing as a tool for putting on one or auxiliary volumes of data with the identifier contains a variety of applications in the real world. Below are some common uses of hash function.

*
String Hashing

Used in the area of data to storage. Especially in the indexing of data and instruments as back end associative structures (ie: hash tables)
*
Cryptographic Hashing

Used for the inspection report / user and confirmation. A strong cryptographic hash function has the property of being very difficult to eliminate the hash result and thus reproduce the original piece of data. cryptographic hash functions are used for user's password hash and hash of the password and stored on the system itself rather than being kept secret. performance hash cryptographic also seen as working Malena compression, being able to represent large numbers of data and ID signs, it is necessary to see whether or not data has been tampered with, and can also be used as data I gnes one to confirm the validity of documents through Other cryptographic.
*
Geometric Hashing

This form of the Shing is used in the field of computer vision for recognition of objects classified in arbitrary scenes.

The initial process involves selecting the region or object of interest. From there using affine invariant algorithms identifying feature such as detector Harris corner (HCD), Scale-Invariant Feature change (reduce) or speeded-Up Robust Features (surf), a set of articles affine is extracted which represent said he saw something or region . This setting is sometimes total-factor or constellation of features. Depending on the nature of the elements detected by the type of device or location yet to be classified could happen to match the two stars of the article although it may be minor differences (such as missing or outlier features) between two sets. The star is then said to be classified set of features.

A hash value is computed from the constellation of features. This is usually done by defining the original position of the hash values which are intended to survive - in this case the hash value is a multidimensional normalized value for the same position. Including the process of computing the hash value to another process determines the distance between two hash values are needed - distance required to measure A instead of deterministic equality operator due to potential differences in terms of the star that went into calculating hash value. Also due to the non-linear state of the environment as a simple Euclidean distance metric is primarily effective as a result of the process of automatically determining a metric distance in a certain position has been a field of research work in academia.

Example of geometric models Shing not include classification of different types of vehicles, with the aim of re-recognition in arbitrary scenes. level of detection can range from simply examining the car, with a specific model of car, and drive specific.
*
Bloom Filters

A Bloom filter allows for a "state of existence" of a large set of possible values represented by a very small piece of memory the size of the total values. In computer science this is known as a membership query with a basic concept in associative containers.

Bloom Filter achieves this through the use of many different hash function and also to allow the query results to the membership of the existence of special value to be some possibility of errors. The Bloom filter provides a guarantee that for any issue of membership there are never any false negatives, however there may be false positive. The possibility of false positive can be controlled by varying the size of the table used for the Bloom filter and also for different numbers of hash function.

Subsequent research done in the area of operations and hash tables and Bloom filters and Mitzenmacher et al. indicate that for the use of more practical for such constructs, the entropy of the data being hashed contributes to the entropy of the function hash, this leads to more on the results of the theory that a complete Bloom optimal filter (the one that gives lower false positive likely to be color table or vice versa) to provide user defined false positive potential can be constructed to operate at two different hash too many such pairwise independent hash function, increasing the efficiency of questions of membership.

Bloom filters are commonly found in applications such as spell checkers, string matching algorithms, network packet analysis tools and web / internet caches.

Saturday, February 26, 2011

Cloud Computting !!! it Rocks

 Calculation describes cloud computing, software, data access, hosting services do not require end-user knowledge of the physical locations of system configuration that provides the services. Parallels to this approach to the electricity grid can be deduced that end users consume power resources without understanding is no need of network element devices required to deliver the service.



Cloud computing is a natural evolution of the widespread adoption of virtualization, service-oriented architecture, autonomic computing service. Details are abstract from all end users who already do not have expertise, or control over the infrastructure, technology and "cloud" that supports them.

 Cloud computing a new supplement, consumption, and delivery model for IT services based on Internet Protocol is described, and it is usually dynamic and scalable virtualized resource provision is often a return and results. The rest - remote computing access to Internet sites provided by the web-based tools or applications that users often use a Web browser as if it's a locally installed programs on your own computer through the use could takes the form ..




What is a world-class CIOs do today operate more efficiently and cost-effective than their peers? The answer is cloud computing. Experienced CIO, Fred Mapp, introduced a new cloud computing white paper today entitled "The chances of confusing the sun." White Paper by the OneNeck IT Services, describes the benefits of cloud computing and explain the different the shape and form of cloud computing.

CIOs, IT managers, and employees are dealing with the challenges keep pace with technology changes to meet the IT function from the IT budget and chief financial officer for the business units to reduce the increasing demand. These challenges are being considered to resolve the cloud to achieve management IT applications and infrastructure flexibility and scalability of the company. However, many companies are still reluctant to make the leap to the cloud.


Fred Max-Planck-sharing, "corporate management, in particular CIO are still hesitant and slow migration to the cloud. Legitimate concerns exist with any new technology or platform development trend. CIO who does not completely sell the value of the cloud and its How to meet the company's information technology plan. "



Mapp continued, "IT transformation may be difficult to change, but the fact is that today's mobile Internet needs of engineering and IT environments, providing a new method of business applications. Cloud computing will move to the forefront of the market is not just a buzzword , but will provide huge benefits the company, to accept it. "

Complimentary White Paper on cloud computing services from OneNeck IT in
http://www.oneneck.com/cloud-computing-white-paper.cfm

Try cloud computing and receive a $25 credit.




Monday, February 21, 2011

Use "Cheat Engine" to increase the download speed for Mu torrent!!!

If you are using Bit torrent, Mu torrent or U torrent and if you want to optimize the speed of download, do follow following simple steps.

It works!!!!

You know... first you have to download Cheat Engine!
Follow the link to do so.....
http://www.cheatengine.org/downloads.php
current version is 6, but if you have the older one that is 5 and advanced.... no problem, it works

1. install the software by following just simple steps during installation
2. Install Bit/Mu/U torrent down loader
3. get your favorite torrent from  the web.
4. Star downloading and inactivate your anti-virus shield
5. Now click on cheat engine icon on desktop to star cheat engine, you will see following interface

6. Select process designating your torrent application using following three steps
       1.Click on left most process search icon



      2. Select process... you may not able to recognize number/code.. but icon is visible.. so not to worry about code or number.
      3. click "Done".
7. Now you have selected a process... now its time to increase the speed of download.... now check the check-box "Enable speed Hack" one right hand side on cheat engine application

8. Default value appear as 1.0..... now replace 1.0 with 0.1
9.Apply option will be appear and click apply.....
Bingo!!!!! You have done it!!!!!
you will see the increase in download speed up to your allotted bandwidth.....
Enjoy Downloading!!!!

Sunday, February 20, 2011

FASTA format of biological sequence representation and small Perl script to create the format.....

FASTA format can be used to represent either one or several sequences of the sequence in one file. Series of one sequence, combined, represent multisequence file. The best source for the description of FASTA / Pearson format documentation Suite FASTA programs. It can be downloaded from any of the free distribution of FASTA and advise its own (see fasta20.doc, fastaVN.doc or fastaVN.me - where VN is the version number).

Sequence in FASTA format is represented as a series of lines that should be no longer than 120 characters, and usually does not exceed 80 characters. This is probably because for preallocation fixed-sized lines in the software: while the majority of users rely on Dec. VT (or compatible) terminals that can display 80 or 132 characters per line. Most people prefer a large font in the 80-character mode, and so was the recommended mode to use 80 characters or less (often 70) in the line of FASTA.

The first line in the FASTA file begins with either ">" ("more") or the symbol ";" (semicolon) and was accepted as a comment. Follow the lines beginning with a semicolon will be ignored by the software. Since only the comment used in the first place, it quickly became used to store a brief description of the sequence, often beginning with a unique number, joining the library, and eventually it became a routine use to always use the ">" on the front line and do not use "; "comments (which are otherwise ignored.)

After the initial line (used to uniquely describe the sequence) is the actual sequence itself into a standard single letter code. Nothing but the correct code will be ignored (including spaces, tabs, asterisks, etc. ..). Originally it was also common for the end of the sequence with "*" (asterisk) character (by analogy with the PIR formatted sequence) and, for the same reason, to leave a blank line between description and sequence.

For example......this is protein sequence of SOS protein involve in signalling pathaway in fruit fly



>gi|18110536|ref|NP_476597.2| Son of sevenless [Drosophila melanogaster]
MFSGPSGHAHTISYGGGIGLGTGGGGGSGGSGSGSQGGGGGIGIGGGGVAGLQDCDGYDFTKCENAARWR
GLFTPSLKKVLEQVHPRVTAKEDALLYVEKLCLRLLAMLCAKPLPHSVQDVEEKVNKSFPAPIDQWALNE
AKEVINSKKRKSVLPTEKVHTLLQKDVLQYKIDSSVSAFLVAVLEYISADILKMAGDYVIKIAHCEITKE
DIEVVMNADRVLMDMLNQSEAHILPSPLSLPAQRASATYEETVKELIHDEKQYQRDLHMIIRVFREELVK
IVSDPRELEPIFSNIMDIYEVTVTLLGSLEDVIEMSQEQSAPCVGSCFEELAEAEEFDVYKKYAYDVTSQ
ASRDALNNLLSKPGASSLTTAGHGFRDAVKYYLPKLLLVPICHAFVYFDYIKHLKDLSSSQDDIESFEQV
QGLLHPLHCDLEKVMASLSKERQVPVSGRVRRQLAIERTRELQMKVEHWEDKDVGQNCNEFIREDSLSKL
GSGKRIWSERKVFLFDGLMVLCKANTKKQTPSAGATAYDYRLKEKYFMRRVDINDRPDSDDLKNSFELAP
RMQPPIVLTAKNAQHKHDWMADLLMVITKSMLDRHLDSILQDIERKHPLRMPSPEIYKFAVPDSGDNIVL
EERESAGVPMIKGATLCKLIERLTYHIYADPTFVRTFLTTYRYFCSPQQLLQLLVERFNIPDPSLVYQDT
GTAGAGGMGGVGGDKEHKNSHREDWKRYRKEYVQPVQFRVLNVLRHWVDHHFYDFEKDPMLLEKLLNFLE
HVNGKSMRKWVDSVLKIVQRKNEQEKSNKKIVYAYGHDPPPIEHHLSVPNDEITLLTLHPLELARQLTLL
EFEMYKNVKPSELVGSPWTKKDKEVKSPNLLKIMKHTTNVTRWIEKSITEAENYEERLAIMQRAIEVMMV
MLELNNFNGILSIVAAMGTASVYRLRWTFQGLPERYRKFLEECRELSDDHLKKYQERLRSINPPCVPFFG
RYLTNILHLEEGNPDLLANTELINFSKRRKVAEIIGEIQQYQNQPYCLNEESTIRQFFEQLDPFNGLSDK
QMSDYLYNESLRIEPRGCKTVPKFPRKWPHIPLKSPGIKPRRQNQTNSSSKLSNSTSSVAAAAAASSTAT
SIATASAPSLHASSIMDAPTAAAANAGSGTLAGEQSPQHNPHAFSVFAPVIIPERNTSS

Lets learn a Perl script that take a string as a input from a file (String represent sequence of nucleotide and /or protein) as an input and generate FASTA format.

#!/usr/bin/perl
$file = $ARGV[0];   # .....input file name from argument
open(FH, $file);       #........ Use file handle
@file_aray = <FH>;     #....... Take all the file lines in one array
close(FH);                    #.......... Close the file

foreach $line(@file_aray)
{
     chomp $line;
     push @temp, $line;
}
$sequence = join('',@temp);
print ">Sequence1\n"
$counter = 0;
@seqaray = split(//,$sequence);
for($i = 0; $i>@seqaray;$i++)
{
       print $seqaray[$i];
       $counter++;
       if ($counter eq 60)
      {
              print "\n";
              $counter = 0;
       }
}

Indexing and hashing in Databases 2

Architectures ranking list as can be clustered or nonclustered.
Non-clustered
The data is in random order, but the logical order is specified by index. This random data row can be spread across the table. Non-clustered index tree list in order of keys and the page data page containing the row index number signal with a leaf surface. Non-clustered index:

    * Row index of the physical system is not like the command.
    * Commonly used to join the column, where, and provisions, created by Order.
    * Tables whose values may be revised again and again is good for.

Microsoft SQL Server creates an index by default when creating clustered index is ordered. And a database on a table can be non-clustered index. According to the table many as 249 nonclustered indexes can be. Also default a clustered index on primary key is created
Clustered
Clustering in a particular setting such specific data block list, match data row as a result of being stored in order. Therefore, only one clustered index on the table in the database can be created. Clustered index to get a lot faster overall, more, but usually only where the clustered index data, or when a range of items is selected, the same or in reverse order from a set can.

Since physical records on disk in such order, the next line item in the series immediately before or after the last one, and very few pieces of information are required to read. The main feature is a list of clustered so that they block the index pointed to the row of data is the ruling body. Something different database files and index data in separate pieces, others the same physical file (s) within two different data blocks moved. Where lines of anything related to physical order list and order of the list of clustered level (leaf) is the same as the original data include lines.
Oracle under the database is known as "tables list announced"

Indexing and hashing in Databases

A database index is a data structure that improves the write speed of data recovery operations on a database table at the expense of the slower and more memory. Indexes can be created with one or more columns in a database table, which is the basis for both rapid random lookups and efficient access of ordered records. The disk space required to store the index is usually less than the required by the table (since indices usually contain only the key fields by which the table is arranged, and excludes all other information in the table), raising the possibility to store indices in memory for a table whose data is too large to store in memory.





In a relational database, an index is a copy of a portion of a table. Some databases extend the power of indexing by indexes are created on functions or expressions. For example, an index on upper (last_name) are created, which only store the uppercase versions of the last_name field in the index. Another option sometimes supported the use of "filtered" indices, where index entries for only those records that satisfy the conditional expression to be created. Another aspect of flexibility is the indexing on user-defined functions, as well as expressions formed from an assortment of built-in functions allow.

Indexes can be defined as unique or not unique. A unique index acts as a constraint on the table by preventing duplicate entries in the index and hence the support table